Re: Spark Write BinaryType Column as continues file to S3

Lalwani, Jayesh Fri, 08 Apr 2022 08:35:00 -0700

What format are you writing the file to? Are you planning on your own custom 
format, or are you planning to use standard formats like parquet?


Note that Spark can write numeric data in most standard formats. If you use  
custom format instead, whoever consumes the data needs to parse your data. This 
adds complexity to your and your consumer's code. You will also need to worry 
about backward compatibility. 

I would suggest that you explore standard formats first before you write custom 
code. If you do have to write data in a custom format, udf is a good way to 
serialize the data into your format

On 4/8/22, 11:14 AM, "Philipp Kraus" <philipp.kraus.flashp...@gmail.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



    Hello,

    I have got a data frame with numerical data in Spark 3.1.1 (Java) which 
should be converted to a binary file.
    My idea is that I create a udf function that generates a byte array based 
on the numerical values, so I can apply this function on each row of the data 
frame and get than a new column with row-wise binary byte data.
    If this is done, I would like to write this column as continues byte stream 
to a file which is stored in a S3 bucket.

    So my question is, is the idea with the udf function a good idea and is it 
possible to write this continues byte stream directly to S3 / is there any 
built-in functionality?
    What is a good strategy to do this?

    Thanks for help
    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Write BinaryType Column as continues file to S3

Reply via email to