Many small files could cause technical issues in both hdfs and spark
though, they do not
generate many stages and tasks in the recent version of spark.
// maropu
On Fri, May 20, 2016 at 2:41 PM, Gavin Yue wrote:
> For logs file I would suggest save as gziped text file
For logs file I would suggest save as gziped text file first. After
aggregation, convert them into parquet by merging a few files.
> On May 19, 2016, at 22:32, Deng Ching-Mallete wrote:
>
> IMO, it might be better to merge or compact the parquet files instead of
>
IMO, it might be better to merge or compact the parquet files instead of
keeping lots of small files in the HDFS. Please refer to [1] for more info.
We also encountered the same issue with the slow query, and it was indeed
caused by the many small parquet files. In our case, we were processing
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to
control RDD partition size
I heard that DF can several files in 1 task
On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515
wrote:
> I’m using a spark streaming program to store log message into
I’m using a spark streaming program to store log message into parquet file
every 10 mins.
Now, when I query the parquet, it usually takes hundreds of thousands of stages
to compute a single count.
I looked into the parquet file’s path and find a great amount of small files.
Do the small files