Re: [I] [Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. (hop)

via GitHub Wed, 05 Feb 2025 05:00:33 -0800


Raja10D commented on issue #4865:
URL: https://github.com/apache/hop/issues/4865#issuecomment-2636766242


   > I would have to check, but I think this is the intended behavior when 
running a pipeline via Beam.
   > There is no way to guarantee that only one file will be written when 
processing large volumes of data using a distributed engine. To avoid having 
multiple tasks writing to the same file (which wouldn't work for an excel file) 
we add unique identifiers.
   
   Issue Description:
   (Running on Spark)
   When building Hop pipeline with a single Excel file to single excel writer, 
the pipeline produces a single output file as expected.(the output filename is 
different if we give result in configuration it is generating a file with 
r**esult_0_88cccfee-c808-47f9-922d-e8f443630f78_1.xlsv** but the content inside 
the file is correct)
   
   When using Hop with two Excel files as input to single excel writer, the 
pipeline produces two separate output files are generated
   **(The content in both files is correct, but I need it as combined one)**
   I will insert the pipeline images:
   
   
![Image](https://github.com/user-attachments/assets/44ca4163-1d96-4a45-9295-b107e15b3d07)
   
![Image](https://github.com/user-attachments/assets/c23c8c72-0512-4eab-a806-1b1c88eb4499)
   
   **Is there a way to overcome this issue and ensure that the pipeline 
produces a single output file when multiple Excel files are used as input? Our 
primary focus is on working with Excel files.**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Task]: When running a Hop pipeline on a Spark Standalone cluster, the pipeline generates temporary files instead of the exact output file specified in the configuration. (hop)

Reply via email to