Alex Ethier created NIFI-12791:
----------------------------------

             Summary: ParseDocument PDF - Missing Pillow dependency
                 Key: NIFI-12791
                 URL: https://issues.apache.org/jira/browse/NIFI-12791
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
    Affects Versions: 2.0.0-M2
            Reporter: Alex Ethier
            Assignee: Alex Ethier


Custom Python processor ParseDocument, when configured to parse PDFs, gives an 
exception below due to a missing import.

The error message `ModuleNotFoundError: No module named 'pillow_heif' indicates 
that the latest version of unstructured dependency now requires 'pillow_heif' 
to be installed.

Full Stacktrace:
{code:java}
py4j.Py4JException: An exception was raised by the Python Proxy. Return 
Message: Traceback (most recent call last):
  File "/opt/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py", line 
2466, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/opt/nifi-2.0.0-SNAPSHOT/python/api/nifiapi/flowfiletransform.py", line 
33, in transformFlowFile
    return self.transform(self.process_context, flowfile)
  File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 
257, in transform
    documents = self.create_docs(context, flowFile)
  File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 
225, in create_docs
    documents = loader.load()
  File 
"/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/unstructured.py",
 line 87, in load
    elements = self._get_elements()
  File 
"/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/pdf.py",
 line 57, in _get_elements
    from unstructured.partition.pdf import partition_pdf
  File 
"/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/unstructured/partition/pdf.py",
 line 38, in <module>
    from pillow_heif import register_heif_opener
ModuleNotFoundError: No module named 'pillow_heif'


        at py4j.Protocol.getReturnValue(Protocol.java:476)
        at 
org.apache.nifi.py4j.client.PythonProxyInvocationHandler.invoke(PythonProxyInvocationHandler.java:64)
        at 
org.apache.nifi.py4j.client.NiFiPythonGateway$1.invoke(NiFiPythonGateway.java:148)
        at jdk.proxy29/jdk.proxy29.$Proxy179.transformFlowFile(Unknown Source)
        at 
org.apache.nifi.python.processor.FlowFileTransformProxy.onTrigger(FlowFileTransformProxy.java:66)
        at 
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
        at 
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1274)
        at 
org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:244)
        at 
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:102)
        at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
        at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
        at 
java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
        at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583) {code}
Including 'pillow-heif' in the list of required dependencies for ParseDocument 
fixes the issue (PR forthcoming).

Another possible fix is locking the version numbers to prevent dependencies 
from causing breaking updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to