[GitHub] [arrow] yurikoomiga commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

GitBox Tue, 24 May 2022 03:34:32 -0700


yurikoomiga commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1135743977


   > @yurikoomiga Can you post a sample file that fails somewhere? (or code to 
reproduce the generation of the file)
   
   I'm sorry to reply you after so long.
   The sample file is so large, so I post the generating code like this:
   
   ```
   import random, string
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   def create_list(type):
       if type == "VARCHAR":
           num = string.ascii_letters + string.digits
           return  "".join(random.sample(num, random.randint(1, 20)))
       elif type == "INT":
           return  random.randint(1,65536)
   
   def chang_column_type(column_type,data_frame_column):
       if "INT" in column_type:
           return  data_frame_column.astype("int32")
       return data_frame_column
   
   def build_parquet_schema(column_name,column_type):
       table_list = list()
       for index, column in enumerate(column_name):
           if "VARCHAR" in column_type[index]:
               table_list.append((column, pa.string()))
           elif "INT" in column_type[index]:
               table_list.append((column,pa.int32()))
           else:
               table_list.append((column, pa.string()))
       return  pa.schema(table_list)
   
   if __name__ == '__main__':
   
       parquet_file ="test.parquet"
   
       column_type,column_name,data_list = list(),list(),list()
       for i in range(0,20):
           column_name.append("TEST%s"%i)
           column_type.append("VARCHAR") if i%2==0 else 
column_type.append("INT")
   
       table_schema = build_parquet_schema(column_name,column_type)
   
       for i in range(0,3*1000*1000):
           data_list.append(list(map(create_list,column_type)))
   
       test_panda_frame = pd.DataFrame(data_list, columns=tuple(column_name))
       for index, column in enumerate(column_name):
           test_panda_frame[column] = chang_column_type(column_type[index], 
test_panda_frame[column])
       table = pa.Table.from_pandas(test_panda_frame, schema=table_schema)
       pq.write_table(table, parquet_file,row_group_size=300*1000*1000)
       exit(0)
   ```
   I run it in ubuntu 9.4.0 and use python3.8, pyarrow 7.0.0
   You can use this to generate a test.parquet file and read any multiple 
columns with using `_reader->set_use_threads(true);` 
   @pitrou @westonpace 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] yurikoomiga commented on issue #13186: An Error Occured While Reading Parquet File Using C++ - GetRecordBatchReader -Corrupt snappy compressed data.

Reply via email to