Re: please help the problem of big parquet file can not be splitted to read
HI ~ zhangliyun Sorry for the late reply. From the meta file you provided(line 1650: "row group 1: RC:1403968 TS:13491534645 OFFSET:4"), there is only one RowGroup in this file if the meta file is the entire result, so it is normal for this file to be read and handled by only one Spark task, other tasks should just finished after reading the footer and performing a RowGroup range filter check. You can consider controlling `parquet.block.size` when writing the paquet file to make it has multiple RowGroups, so that it can be parallel read and handled by multiple Spark tasks. As for why 80 tasks were started to read files, you can study the logic of the `org.apache.spark.sql.execution.datasources.FilePartition#maxSplitBytes`, `maxSplitBytes` is not only determined by `spark.sql.files.maxPartitionBytes`. Yang Jie 发件人: zhangliyun 日期: 2023年3月24日 星期五 09:09 收件人: Alfie Davidson 抄送: yangjie01 , Spark Dev List 主题: Re:Re: please help the problem of big parquet file can not be splitted to read @Yangjie. the meta file is attached. I use "hadoop jar parquet-tools-1.11.2.jar meta hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet" to get the info not sure this is what you mentioned, i did not find row group info the in the meta file, if my command is wrong , tell me @Alfie Davidson if like that, everything is reasonable. At 2023-03-24 04:36:47, "Alfie Davidson" wrote: I’m pretty sure snappy file is not splittable. That’s why you have a single task (and most likely core) reading the 1.9GB snappy file Sent from my iPhone On 23 Mar 2023, at 07:36, yangjie01 wrote: Is there only one RowGroup for this file? You can check this by printing the file's metadata using the `meta` command of `parquet-cli`. Yang Jie 发件人: zhangliyun 日期: 2023年3月23日 星期四 15:16 收件人: Spark Dev List 主题: please help the problem of big parquet file can not be splitted to read hi all i want to ask a question about how to split the big parquet file when spark reading , i have a parquet file which is 1.9G. i have set spark.sql.files.maxPartitionBytes=12800;it start 80 tasks( 80*128M~1.9G) but it seems every partition is not even. One partition read 1.9G data while others read only 3M(see attach pic). I have checked the compress codec of the file , it is snappy which can be splittable hadoop jar parquet-tools-1.11.2.jar head hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 368273 records. 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor [.snappy] 23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in memory in 13227 ms. row count = 368273 the spark code like ``` spark.read .format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat") .option("mergeSchema", "false") .load("") ``` Appreciate your help Best Regards Kelly Zhang
Re: please help the problem of big parquet file can not be splitted to read
I’m pretty sure snappy file is not splittable. That’s why you have a single task (and most likely core) reading the 1.9GB snappy file Sent from my iPhone > On 23 Mar 2023, at 07:36, yangjie01 wrote: > > Is there only one RowGroup for this file? You can check this by printing the > file's metadata using the `meta` command of `parquet-cli`. > > Yang Jie > > 发件人: zhangliyun > 日期: 2023年3月23日 星期四 15:16 > 收件人: Spark Dev List > 主题: please help the problem of big parquet file can not be splitted to read > > hi all > i want to ask a question about how to split the big parquet file when spark > reading , i have a parquet file which is 1.9G. i have set > spark.sql.files.maxPartitionBytes=12800;it start 80 tasks( > 80*128M~1.9G) but > it seems every partition is not even. One partition read 1.9G data while > others read only 3M(see attach pic). I have checked the compress codec of the > file , it is snappy which can be splittable > > hadoop jar parquet-tools-1.11.2.jar head > hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet > 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader > initialized will read a total of 368273 records. > 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading > next block > 23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor > [.snappy] > 23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in > memory in 13227 ms. row count = 368273 > > > > the spark code like > ``` > > spark.read > > .format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat") > .option("mergeSchema", "false") > .load("") > > > ``` > > Appreciate your help > > > Best Regards > > Kelly Zhang
Re: please help the problem of big parquet file can not be splitted to read
Is there only one RowGroup for this file? You can check this by printing the file's metadata using the `meta` command of `parquet-cli`. Yang Jie 发件人: zhangliyun 日期: 2023年3月23日 星期四 15:16 收件人: Spark Dev List 主题: please help the problem of big parquet file can not be splitted to read hi all i want to ask a question about how to split the big parquet file when spark reading , i have a parquet file which is 1.9G. i have set spark.sql.files.maxPartitionBytes=12800;it start 80 tasks( 80*128M~1.9G) but it seems every partition is not even. One partition read 1.9G data while others read only 3M(see attach pic). I have checked the compress codec of the file , it is snappy which can be splittable hadoop jar parquet-tools-1.11.2.jar head hdfs://horton/user/yazou/VenInv/shifu_norm_emb_bert/emb_valid_sel_train_sam_1ep.parquet 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 368273 records. 23/03/22 20:41:56 INFO hadoop.InternalParquetRecordReader: at row 0. reading next block 23/03/22 20:42:09 INFO compress.CodecPool: Got brand-new decompressor [.snappy] 23/03/22 20:42:09 INFO hadoop.InternalParquetRecordReader: block read in memory in 13227 ms. row count = 368273 the spark code like ``` spark.read .format("org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat") .option("mergeSchema", "false") .load("") ``` Appreciate your help Best Regards Kelly Zhang