[ 
https://issues.apache.org/jira/browse/NIFI-12619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Villard updated NIFI-12619:
----------------------------------
    Description: 
Trying to instantiate the Python processor ParseDocument, I get this error:
{code:java}
2024-01-16 18:00:00,573 INFO org.apache.nifi.py4j.ExtensionManager Importing 
dependencies ['langchain', 'unstructured', 'unstructured-inference', 
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'] for 
ParseDocument to 
/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT
 using command 
['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
 '-m', 'pip', 'install', '--no-cache-dir', '--target', 
'/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
 'langchain', 'unstructured', 'unstructured-inference', 
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx']
2024-01-16 18:00:15,752 ERROR py4j.java_gateway There was an exception while 
executing the Python Proxy on the Python Side.
Traceback (most recent call last):
  File 
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py",
 line 2466, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./python/framework/Controller.py",
 line 72, in downloadDependencies
    self.extensionManager.import_external_dependencies(processor_details, 
work_dir)
  File 
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/ExtensionManager.py",
 line 511, in import_external_dependencies
    raise RuntimeError(f"Failed to import requirements for {class_name}: 
process exited with status code {result}")
RuntimeError: Failed to import requirements for ParseDocument: process exited 
with status code 
CompletedProcess(args=['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
 '-m', 'pip', 'install', '--no-cache-dir', '--target', 
'/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
 'langchain', 'unstructured', 'unstructured-inference', 
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'], returncode=1) 
{code}
If trying to run the pip command manually, I get
{code:java}
no matches found: pdfminer.six[image] {code}
Changing the required dependency to just *pdfminer.six* fixes the issue and I 
can instantiate the processor.

However when trying to use it against a PDF file, I get:
{code:java}
ModuleNotFoundError: No module named 'pikepdf' 
ModuleNotFoundError: No module named 'pypdf'{code}
After adding the above dependencies, I get:
{code:java}
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is 
poppler installed and in PATH? {code}
Based on

[https://pdf2image.readthedocs.io/en/latest/installation.html]

It sounds like poppler would need to be installed separately. I did it with 
brew for my local instance. -Probably worth adding this in the docs if doable.- 
This is specified in the description of the processor.

At this point I was able to use the processor to parse a PDF file.

 

  was:
Trying to instantiate the Python processor ParseDocument, I get this error:

 
{code:java}
2024-01-16 18:00:00,573 INFO org.apache.nifi.py4j.ExtensionManager Importing 
dependencies ['langchain', 'unstructured', 'unstructured-inference', 
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'] for 
ParseDocument to 
/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT
 using command 
['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
 '-m', 'pip', 'install', '--no-cache-dir', '--target', 
'/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
 'langchain', 'unstructured', 'unstructured-inference', 
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx']
2024-01-16 18:00:15,752 ERROR py4j.java_gateway There was an exception while 
executing the Python Proxy on the Python Side.
Traceback (most recent call last):
  File 
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py",
 line 2466, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File 
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./python/framework/Controller.py",
 line 72, in downloadDependencies
    self.extensionManager.import_external_dependencies(processor_details, 
work_dir)
  File 
"/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/ExtensionManager.py",
 line 511, in import_external_dependencies
    raise RuntimeError(f"Failed to import requirements for {class_name}: 
process exited with status code {result}")
RuntimeError: Failed to import requirements for ParseDocument: process exited 
with status code 
CompletedProcess(args=['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
 '-m', 'pip', 'install', '--no-cache-dir', '--target', 
'/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
 'langchain', 'unstructured', 'unstructured-inference', 
'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'], returncode=1) 
{code}
If trying to run the pip command manually, I get

 

 
{code:java}
no matches found: pdfminer.six[image] {code}
Changing the required dependency to just *pdfminer.six* fixes the issue and I 
can instantiate the processor.

 

However when trying to use it against a PDF file, I get:

 
{code:java}
ModuleNotFoundError: No module named 'pikepdf' 
ModuleNotFoundError: No module named 'pypdf'{code}
After adding the above dependencies, I get:
{code:java}
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is 
poppler installed and in PATH? {code}
Based on

[https://pdf2image.readthedocs.io/en/latest/installation.html]

It sounds like poppler would need to be installed separately. I did it with 
brew for my local instance. Probably worth adding this in the docs if doable.

At this point I was able to use the processor to parse a PDF file.

 


> Unable to instantiate ParseDocument processor
> ---------------------------------------------
>
>                 Key: NIFI-12619
>                 URL: https://issues.apache.org/jira/browse/NIFI-12619
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 2.0.0-M1
>         Environment: Python 3.11.6
>            Reporter: Pierre Villard
>            Assignee: Pierre Villard
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Trying to instantiate the Python processor ParseDocument, I get this error:
> {code:java}
> 2024-01-16 18:00:00,573 INFO org.apache.nifi.py4j.ExtensionManager Importing 
> dependencies ['langchain', 'unstructured', 'unstructured-inference', 
> 'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
> 'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'] for 
> ParseDocument to 
> /Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT
>  using command 
> ['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
>  '-m', 'pip', 'install', '--no-cache-dir', '--target', 
> '/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
>  'langchain', 'unstructured', 'unstructured-inference', 
> 'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
> 'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx']
> 2024-01-16 18:00:15,752 ERROR py4j.java_gateway There was an exception while 
> executing the Python Proxy on the Python Side.
> Traceback (most recent call last):
>   File 
> "/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/py4j/java_gateway.py",
>  line 2466, in _call_proxy
>     return_value = getattr(self.pool[obj_id], method)(*params)
>                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File 
> "/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./python/framework/Controller.py",
>  line 72, in downloadDependencies
>     self.extensionManager.import_external_dependencies(processor_details, 
> work_dir)
>   File 
> "/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/python/framework/ExtensionManager.py",
>  line 511, in import_external_dependencies
>     raise RuntimeError(f"Failed to import requirements for {class_name}: 
> process exited with status code {result}")
> RuntimeError: Failed to import requirements for ParseDocument: process exited 
> with status code 
> CompletedProcess(args=['/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT/bin/python3',
>  '-m', 'pip', 'install', '--no-cache-dir', '--target', 
> '/Users/pierre/dev/github/nifi/nifi-assembly/target/nifi-2.0.0-SNAPSHOT-bin/nifi-2.0.0-SNAPSHOT/./work/python/extensions/ParseDocument/2.0.0-SNAPSHOT',
>  'langchain', 'unstructured', 'unstructured-inference', 
> 'unstructured_pytesseract', 'numpy', 'opencv-python', 'pdf2image', 
> 'pdfminer.six[image]', 'python-docx', 'openpyxl', 'python-pptx'], 
> returncode=1) {code}
> If trying to run the pip command manually, I get
> {code:java}
> no matches found: pdfminer.six[image] {code}
> Changing the required dependency to just *pdfminer.six* fixes the issue and I 
> can instantiate the processor.
> However when trying to use it against a PDF file, I get:
> {code:java}
> ModuleNotFoundError: No module named 'pikepdf' 
> ModuleNotFoundError: No module named 'pypdf'{code}
> After adding the above dependencies, I get:
> {code:java}
> pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is 
> poppler installed and in PATH? {code}
> Based on
> [https://pdf2image.readthedocs.io/en/latest/installation.html]
> It sounds like poppler would need to be installed separately. I did it with 
> brew for my local instance. -Probably worth adding this in the docs if 
> doable.- This is specified in the description of the processor.
> At this point I was able to use the processor to parse a PDF file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to