Alex Ethier created NIFI-12791: ---------------------------------- Summary: ParseDocument PDF - Missing Pillow dependency Key: NIFI-12791 URL: https://issues.apache.org/jira/browse/NIFI-12791 Project: Apache NiFi Issue Type: Bug Components: Extensions Affects Versions: 2.0.0-M2 Reporter: Alex Ethier Assignee: Alex Ethier
Custom Python processor ParseDocument, when configured to parse PDFs, gives an exception below due to a missing import. The error message `ModuleNotFoundError: No module named 'pillow_heif' indicates that the latest version of unstructured dependency now requires 'pillow_heif' to be installed. Full Stacktrace: {code:java} py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/opt/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py", line 2466, in _call_proxy return_value = getattr(self.pool[obj_id], method)(*params) File "/opt/nifi-2.0.0-SNAPSHOT/python/api/nifiapi/flowfiletransform.py", line 33, in transformFlowFile return self.transform(self.process_context, flowfile) File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 257, in transform documents = self.create_docs(context, flowFile) File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 225, in create_docs documents = loader.load() File "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/unstructured.py", line 87, in load elements = self._get_elements() File "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/pdf.py", line 57, in _get_elements from unstructured.partition.pdf import partition_pdf File "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/unstructured/partition/pdf.py", line 38, in <module> from pillow_heif import register_heif_opener ModuleNotFoundError: No module named 'pillow_heif' at py4j.Protocol.getReturnValue(Protocol.java:476) at org.apache.nifi.py4j.client.PythonProxyInvocationHandler.invoke(PythonProxyInvocationHandler.java:64) at org.apache.nifi.py4j.client.NiFiPythonGateway$1.invoke(NiFiPythonGateway.java:148) at jdk.proxy29/jdk.proxy29.$Proxy179.transformFlowFile(Unknown Source) at org.apache.nifi.python.processor.FlowFileTransformProxy.onTrigger(FlowFileTransformProxy.java:66) at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1274) at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:244) at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:102) at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.base/java.lang.Thread.run(Thread.java:1583) {code} Including 'pillow-heif' in the list of required dependencies for ParseDocument fixes the issue (PR forthcoming). Another possible fix is locking the version numbers to prevent dependencies from causing breaking updates. -- This message was sent by Atlassian Jira (v8.20.10#820010)