Re: [DISCUSS] Python processor dependency management

Gábor Gyimesi Tue, 24 Sep 2024 10:13:29 -0700

Joe,

In this scenario we are talking about very similar use cases for these
processors, which would result in very similar processor code.
Probably similar properties, similar functions used by all of these
processors. That would result in a common codebase, which after some
refactoring would result in common utility functions and maybe some
common base classes, which would be extracted to some common modules
for these processors. That would make it convenient for all of them to
reside in the same package, otherwise it would result in a lot of code
duplication.


Regards,
Gabor

On Tue, 24 Sept 2024 at 18:33, Joe Witt <joe.w...@gmail.com> wrote:
>
> Gabor
>
> Thanks.  While I understand the logical grouping *these all do doc parsing
> things* why is it important for them to be in the same package? Why not
> have separate document parsing packages each which can be built/deployed
> separately?
>
> Thanks
>
> On Tue, Sep 24, 2024 at 9:29 AM Gábor Gyimesi <lordga...@apache.org> wrote:
>
> > David, Joe,
> >
> > You are right, it's easier to understand such a use case with an
> > example. We currently have a ParseDocument processor in our python
> > extensions with PLAIN_TEXT, HTML, MARKDOWN, PDF, WORD, EXCEL,
> > POWERPOINT input format support, using the unstructured library on its
> > own or through langchain. The unstructured library has support for
> > several input formats and depending on that format it provides
> > additional support in extensions like unstructured[pdf],
> > unstructured[csv] and so on.
> >
> > This ParseDocument could be quite cumbersome on its own so let's say
> > we would like to create separate processors for each format. We would
> > create a "parse_document" package and have separate processors for
> > each document format like ParseText, ParsePdf, ParseCSV, then we would
> > like to install only the format specific unstructured package
> > extension for a processor. In this case a ParseText would only require
> > the base unstructured pip package, but for ParsePdf that would require
> > unstructured[pdf] which also has some very large transitive
> > dependencies like nvidia cuda packages and pytorch. The
> > unstructured[pdf] package installs almost 6GB of dependencies in the
> > latest release, which is unnecessary if we only want to use a much
> > more lightweight processor from the same package like a ParseText
> > processor for example.
> >
> > But this is just one example, I think there could be other use cases
> > as well, when the same package that contains logically inseparable
> > processors could have different dependencies and should not be
> > installed in a processor specific virtual environment if they are not
> > used.
> >
> > Regards,
> > Gabor Gyimesi
> >
> > On Tue, 24 Sept 2024 at 16:52, David Handermann
> > <exceptionfact...@apache.org> wrote:
> > >
> > > Gabor,
> > >
> > > On a similar note, it would be helpful to provide a concrete example.
> > >
> > > Unlike Java NARs, Python Processors do not have the same concept of
> > > multiple layers of parent class loaders right now. Virtual
> > > environments provide dependency sharing, but there isn't the same
> > > concept of sharing dependencies. Attempting to implement something
> > > similar to NAR hierarchy for Python Processors presents some important
> > > questions that would have to be addressed.
> > >
> > > Having a concrete example where to consider these complexities would
> > > be a helpful way to evaluate whether it makes sense to introduce
> > > additional dependency loading concepts for Python.
> > >
> > > Regards,
> > > David Handermann
> > >
> > > On Tue, Sep 24, 2024 at 9:41 AM Joe Witt <joe.w...@gmail.com> wrote:
> > > >
> > > > Gabor
> > > >
> > > > Can you please describe a specific case or cases where ProcessorA and
> > > > ProcessorB should be in the same package/module and yet have such
> > vastly
> > > > different (100s of MB or even GB) of dependency requirements?
> > > >
> > > > Thanks
> > > > Joe
> > > >
> > > > On Tue, Sep 24, 2024 at 7:32 AM Ferenc Gerlits <fgerl...@apache.org>
> > wrote:
> > > >
> > > > > Hi Gabor,
> > > > >
> > > > > I like this approach, and I think the restriction you propose (that
> > > > > all utility files in the package use the same dependencies, and extra
> > > > > dependencies for processor A are only used in ProcessorA.py) is
> > > > > reasonable.  I would be happy to implement this if there are no
> > > > > objections.
> > > > >
> > > > > Thanks,
> > > > > Ferenc
> > > > >
> >

Re: [DISCUSS] Python processor dependency management

Reply via email to