Gabor

Thanks.  While I understand the logical grouping *these all do doc parsing
things* why is it important for them to be in the same package? Why not
have separate document parsing packages each which can be built/deployed
separately?

Thanks

On Tue, Sep 24, 2024 at 9:29 AM Gábor Gyimesi <lordga...@apache.org> wrote:

> David, Joe,
>
> You are right, it's easier to understand such a use case with an
> example. We currently have a ParseDocument processor in our python
> extensions with PLAIN_TEXT, HTML, MARKDOWN, PDF, WORD, EXCEL,
> POWERPOINT input format support, using the unstructured library on its
> own or through langchain. The unstructured library has support for
> several input formats and depending on that format it provides
> additional support in extensions like unstructured[pdf],
> unstructured[csv] and so on.
>
> This ParseDocument could be quite cumbersome on its own so let's say
> we would like to create separate processors for each format. We would
> create a "parse_document" package and have separate processors for
> each document format like ParseText, ParsePdf, ParseCSV, then we would
> like to install only the format specific unstructured package
> extension for a processor. In this case a ParseText would only require
> the base unstructured pip package, but for ParsePdf that would require
> unstructured[pdf] which also has some very large transitive
> dependencies like nvidia cuda packages and pytorch. The
> unstructured[pdf] package installs almost 6GB of dependencies in the
> latest release, which is unnecessary if we only want to use a much
> more lightweight processor from the same package like a ParseText
> processor for example.
>
> But this is just one example, I think there could be other use cases
> as well, when the same package that contains logically inseparable
> processors could have different dependencies and should not be
> installed in a processor specific virtual environment if they are not
> used.
>
> Regards,
> Gabor Gyimesi
>
> On Tue, 24 Sept 2024 at 16:52, David Handermann
> <exceptionfact...@apache.org> wrote:
> >
> > Gabor,
> >
> > On a similar note, it would be helpful to provide a concrete example.
> >
> > Unlike Java NARs, Python Processors do not have the same concept of
> > multiple layers of parent class loaders right now. Virtual
> > environments provide dependency sharing, but there isn't the same
> > concept of sharing dependencies. Attempting to implement something
> > similar to NAR hierarchy for Python Processors presents some important
> > questions that would have to be addressed.
> >
> > Having a concrete example where to consider these complexities would
> > be a helpful way to evaluate whether it makes sense to introduce
> > additional dependency loading concepts for Python.
> >
> > Regards,
> > David Handermann
> >
> > On Tue, Sep 24, 2024 at 9:41 AM Joe Witt <joe.w...@gmail.com> wrote:
> > >
> > > Gabor
> > >
> > > Can you please describe a specific case or cases where ProcessorA and
> > > ProcessorB should be in the same package/module and yet have such
> vastly
> > > different (100s of MB or even GB) of dependency requirements?
> > >
> > > Thanks
> > > Joe
> > >
> > > On Tue, Sep 24, 2024 at 7:32 AM Ferenc Gerlits <fgerl...@apache.org>
> wrote:
> > >
> > > > Hi Gabor,
> > > >
> > > > I like this approach, and I think the restriction you propose (that
> > > > all utility files in the package use the same dependencies, and extra
> > > > dependencies for processor A are only used in ProcessorA.py) is
> > > > reasonable.  I would be happy to implement this if there are no
> > > > objections.
> > > >
> > > > Thanks,
> > > > Ferenc
> > > >
>

Reply via email to