David, Joe,

You are right, it's easier to understand such a use case with an
example. We currently have a ParseDocument processor in our python
extensions with PLAIN_TEXT, HTML, MARKDOWN, PDF, WORD, EXCEL,
POWERPOINT input format support, using the unstructured library on its
own or through langchain. The unstructured library has support for
several input formats and depending on that format it provides
additional support in extensions like unstructured[pdf],
unstructured[csv] and so on.

This ParseDocument could be quite cumbersome on its own so let's say
we would like to create separate processors for each format. We would
create a "parse_document" package and have separate processors for
each document format like ParseText, ParsePdf, ParseCSV, then we would
like to install only the format specific unstructured package
extension for a processor. In this case a ParseText would only require
the base unstructured pip package, but for ParsePdf that would require
unstructured[pdf] which also has some very large transitive
dependencies like nvidia cuda packages and pytorch. The
unstructured[pdf] package installs almost 6GB of dependencies in the
latest release, which is unnecessary if we only want to use a much
more lightweight processor from the same package like a ParseText
processor for example.

But this is just one example, I think there could be other use cases
as well, when the same package that contains logically inseparable
processors could have different dependencies and should not be
installed in a processor specific virtual environment if they are not
used.

Regards,
Gabor Gyimesi

On Tue, 24 Sept 2024 at 16:52, David Handermann
<exceptionfact...@apache.org> wrote:
>
> Gabor,
>
> On a similar note, it would be helpful to provide a concrete example.
>
> Unlike Java NARs, Python Processors do not have the same concept of
> multiple layers of parent class loaders right now. Virtual
> environments provide dependency sharing, but there isn't the same
> concept of sharing dependencies. Attempting to implement something
> similar to NAR hierarchy for Python Processors presents some important
> questions that would have to be addressed.
>
> Having a concrete example where to consider these complexities would
> be a helpful way to evaluate whether it makes sense to introduce
> additional dependency loading concepts for Python.
>
> Regards,
> David Handermann
>
> On Tue, Sep 24, 2024 at 9:41 AM Joe Witt <joe.w...@gmail.com> wrote:
> >
> > Gabor
> >
> > Can you please describe a specific case or cases where ProcessorA and
> > ProcessorB should be in the same package/module and yet have such vastly
> > different (100s of MB or even GB) of dependency requirements?
> >
> > Thanks
> > Joe
> >
> > On Tue, Sep 24, 2024 at 7:32 AM Ferenc Gerlits <fgerl...@apache.org> wrote:
> >
> > > Hi Gabor,
> > >
> > > I like this approach, and I think the restriction you propose (that
> > > all utility files in the package use the same dependencies, and extra
> > > dependencies for processor A are only used in ProcessorA.py) is
> > > reasonable.  I would be happy to implement this if there are no
> > > objections.
> > >
> > > Thanks,
> > > Ferenc
> > >

Reply via email to