Pierre Villard created NIFI-12619:
-------------------------------------
Summary: Unable to instantiate ParseDocument processor
Key: NIFI-12619
URL: https://issues.apache.org/jira/browse/NIFI-12619
Project: Apache NiFi
Issue Type: Bug
Components: Extensions
Affects Versions: 2.0.0-M1
Environment: Python 3.11.6
Reporter: Pierre Villard
Assignee: Pierre Villard
Trying to instantiate the Python processor ParseDocument, I get this error:
{code:java}
2024-01-16 18:00:00,573 INFO org.apache.nifi.py4j.ExtensionManager Importing
dependencies ['langchain', 'unstructured', 'unstructured-inference',
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image',
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'] for
ParseDocument to
/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT
using command
['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
'-m', 'pip', 'install', '--no-cache-dir', '--target',
'/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
'langchain', 'unstructured', 'unstructured-inference',
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image',
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx']
2024-01-16 18:00:15,752 ERROR py4j.java_gateway There was an exception while
executing the Python Proxy on the Python Side.
Traceback (most recent call last):
File
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py",
line 2466, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./python/framework/Controller.py",
line 72, in downloadDependencies
self.extensionManager.import_external_dependencies(processor_details,
work_dir)
File
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/ExtensionManager.py",
line 511, in import_external_dependencies
raise RuntimeError(f"Failed to import requirements for {class_name}:
process exited with status code {result}")
RuntimeError: Failed to import requirements for ParseDocument: process exited
with status code
CompletedProcess(args=['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
'-m', 'pip', 'install', '--no-cache-dir', '--target',
'/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
'langchain', 'unstructured', 'unstructured-inference',
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image',
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'], returncode=1)
{code}
If trying to run the pip command manually, I get
{code:java}
no matches found: pdfminer.six[image] {code}
Changing the required dependency to just *pdfminer.six* fixes the issue and I
can instantiate the processor.
However when trying to use it against a PDF file, I get:
{code:java}
ModuleNotFoundError: No module named 'pikepdf'
ModuleNotFoundError: No module named 'pypdf'{code}
After adding the above dependencies, I get:
{code:java}
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is
poppler installed and in PATH? {code}
Based on
[https://pdf2image.readthedocs.io/en/latest/installation.html]
It sounds like poppler would need to be installed separately. I did it with
brew for my local instance. Probably worth adding this in the docs if doable.
At this point I was able to use the processor to parse a PDF file.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)