Raja10D commented on issue #4908:
URL: https://github.com/apache/hop/issues/4908#issuecomment-2659143548

   > Or in a spark context we even have the [Beam file 
input](https://hop.apache.org//manual/latest/pipeline/transforms/beamfileinput.html)
 which is a CSV reader for the distributed systems beam supports
   > 
   > it's a wrapper around the Beam CsvIO
   
   
   We are developing a custom CSV input component for Spark to eliminate the 
additional step of exporting a Beam File Definition (metadata.json) when using 
the Beam File Input transform.
   
   **Current Issue with Beam File Input on Spark**
   When using the Beam File Input transform in Hop, it generates an .hpl file 
that references a Beam File Definition. To execute this pipeline on Spark, we 
must first export the metadata.json file using Hop GUI. If the metadata.json 
file does not contain the Beam File Definition for our CSV input, the Spark job 
will fail.
   
   Additionally, every time we create a new pipeline with Beam File Input, we 
must manually export the metadata.json file and ensure it is available when 
running spark-submit. This extra step adds complexity when automating pipelines 
in a distributed Spark environment.
   
   **Difference Between .hpl Files**
   The .hpl file generated by CSV File Input is different from the .hpl file 
generated by Beam File Input.
   The Beam File Input .hpl file requires a reference to the Beam File 
Definition, which forces us to export the metadata.json manually.
   
   
   On the other hand, the Excel Input component works seamlessly on Spark 
without requiring additional metadata files.
   ****Our Goal:** CSV Input Should Work Like Excel Input on Spark
   To simplify the process, we are building a custom CSV input component for 
Spark that behaves like the Excel Input component—allowing users to directly 
read CSV files on Spark without exporting a separate metadata.json file.**
   
   Would love to hear your thoughts! Is there an internal way to achieve this, 
or does our approach make sense for a Hop-based Spark pipeline?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to