[ 
https://issues.apache.org/jira/browse/NIFI-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Ethier updated NIFI-12791:
-------------------------------
    Status: Patch Available  (was: In Progress)

> ParseDocument PDF - Missing pillow-heif dependency
> --------------------------------------------------
>
>                 Key: NIFI-12791
>                 URL: https://issues.apache.org/jira/browse/NIFI-12791
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 2.0.0-M2
>            Reporter: Alex Ethier
>            Assignee: Alex Ethier
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Custom Python processor ParseDocument, when configured to parse PDFs, gives 
> an exception below due to a missing import.
> The error message `ModuleNotFoundError: No module named 'pillow_heif' 
> indicates that the latest version of the 'unstructured' dependency now 
> requires 'pillow_heif' to also be installed.
> Full Stacktrace:
> {code:java}
> py4j.Py4JException: An exception was raised by the Python Proxy. Return 
> Message: Traceback (most recent call last):
>   File "/opt/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py", line 
> 2466, in _call_proxy
>     return_value = getattr(self.pool[obj_id], method)(*params)
>   File "/opt/nifi-2.0.0-SNAPSHOT/python/api/nifiapi/flowfiletransform.py", 
> line 33, in transformFlowFile
>     return self.transform(self.process_context, flowfile)
>   File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 
> 257, in transform
>     documents = self.create_docs(context, flowFile)
>   File "/opt/nifi-2.0.0-SNAPSHOT/./python/extensions/ParseDocument.py", line 
> 225, in create_docs
>     documents = loader.load()
>   File 
> "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/unstructured.py",
>  line 87, in load
>     elements = self._get_elements()
>   File 
> "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/langchain_community/document_loaders/pdf.py",
>  line 57, in _get_elements
>     from unstructured.partition.pdf import partition_pdf
>   File 
> "/opt/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/unstructured/partition/pdf.py",
>  line 38, in <module>
>     from pillow_heif import register_heif_opener
> ModuleNotFoundError: No module named 'pillow_heif'
>       at py4j.Protocol.getReturnValue(Protocol.java:476)
>       at 
> org.apache.nifi.py4j.client.PythonProxyInvocationHandler.invoke(PythonProxyInvocationHandler.java:64)
>       at 
> org.apache.nifi.py4j.client.NiFiPythonGateway$1.invoke(NiFiPythonGateway.java:148)
>       at jdk.proxy29/jdk.proxy29.$Proxy179.transformFlowFile(Unknown Source)
>       at 
> org.apache.nifi.python.processor.FlowFileTransformProxy.onTrigger(FlowFileTransformProxy.java:66)
>       at 
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
>       at 
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1274)
>       at 
> org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:244)
>       at 
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:102)
>       at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
>       at 
> java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
>       at java.base/java.lang.Thread.run(Thread.java:1583) {code}
> Including 'pillow-heif' in the list of required dependencies for 
> ParseDocument fixes the issue (PR forthcoming).
> Another possible fix is locking the version numbers to prevent dependencies 
> from causing breaking updates.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to