Gabor Thanks. While I understand the logical grouping *these all do doc parsing things* why is it important for them to be in the same package? Why not have separate document parsing packages each which can be built/deployed separately?
Thanks On Tue, Sep 24, 2024 at 9:29 AM Gábor Gyimesi <lordga...@apache.org> wrote: > David, Joe, > > You are right, it's easier to understand such a use case with an > example. We currently have a ParseDocument processor in our python > extensions with PLAIN_TEXT, HTML, MARKDOWN, PDF, WORD, EXCEL, > POWERPOINT input format support, using the unstructured library on its > own or through langchain. The unstructured library has support for > several input formats and depending on that format it provides > additional support in extensions like unstructured[pdf], > unstructured[csv] and so on. > > This ParseDocument could be quite cumbersome on its own so let's say > we would like to create separate processors for each format. We would > create a "parse_document" package and have separate processors for > each document format like ParseText, ParsePdf, ParseCSV, then we would > like to install only the format specific unstructured package > extension for a processor. In this case a ParseText would only require > the base unstructured pip package, but for ParsePdf that would require > unstructured[pdf] which also has some very large transitive > dependencies like nvidia cuda packages and pytorch. The > unstructured[pdf] package installs almost 6GB of dependencies in the > latest release, which is unnecessary if we only want to use a much > more lightweight processor from the same package like a ParseText > processor for example. > > But this is just one example, I think there could be other use cases > as well, when the same package that contains logically inseparable > processors could have different dependencies and should not be > installed in a processor specific virtual environment if they are not > used. > > Regards, > Gabor Gyimesi > > On Tue, 24 Sept 2024 at 16:52, David Handermann > <exceptionfact...@apache.org> wrote: > > > > Gabor, > > > > On a similar note, it would be helpful to provide a concrete example. > > > > Unlike Java NARs, Python Processors do not have the same concept of > > multiple layers of parent class loaders right now. Virtual > > environments provide dependency sharing, but there isn't the same > > concept of sharing dependencies. Attempting to implement something > > similar to NAR hierarchy for Python Processors presents some important > > questions that would have to be addressed. > > > > Having a concrete example where to consider these complexities would > > be a helpful way to evaluate whether it makes sense to introduce > > additional dependency loading concepts for Python. > > > > Regards, > > David Handermann > > > > On Tue, Sep 24, 2024 at 9:41 AM Joe Witt <joe.w...@gmail.com> wrote: > > > > > > Gabor > > > > > > Can you please describe a specific case or cases where ProcessorA and > > > ProcessorB should be in the same package/module and yet have such > vastly > > > different (100s of MB or even GB) of dependency requirements? > > > > > > Thanks > > > Joe > > > > > > On Tue, Sep 24, 2024 at 7:32 AM Ferenc Gerlits <fgerl...@apache.org> > wrote: > > > > > > > Hi Gabor, > > > > > > > > I like this approach, and I think the restriction you propose (that > > > > all utility files in the package use the same dependencies, and extra > > > > dependencies for processor A are only used in ProcessorA.py) is > > > > reasonable. I would be happy to implement this if there are no > > > > objections. > > > > > > > > Thanks, > > > > Ferenc > > > > >