Dear Apache Spark community,
I hope this email finds you well. My name is Ruben, and I am an enthusiastic
user of Apache Spark, specifically through the Databricks platform. I am
reaching out to you today to seek your assistance and guidance regarding a
specific use case.
I have been
Yes that binary files function looks interesting, thanks for the tip.
Some followup questions:
- I always wonder when people are talking about 'small' files and 'large'
files. Is there any rule of thumb when these things apply? Are small files
those which can fit completely in memory on the node
Hey,
We have files organized on hdfs in this manner:
base_folder
|-
|- file1
|- file2
|- ...
|-
|- file1
|- file2
|- ...
| - ...
We want to be able to do the following operation on our data:
- for each ID we want to