yurikoomiga commented on issue #13186: URL: https://github.com/apache/arrow/issues/13186#issuecomment-1135743977
> @yurikoomiga Can you post a sample file that fails somewhere? (or code to reproduce the generation of the file) I'm sorry to reply you after so long. The sample file is so large, so I post the generating code like this: ``` import random, string import pandas as pd import pyarrow as pa import pyarrow.parquet as pq def create_list(type): if type == "VARCHAR": num = string.ascii_letters + string.digits return "".join(random.sample(num, random.randint(1, 20))) elif type == "INT": return random.randint(1,65536) def chang_column_type(column_type,data_frame_column): if "INT" in column_type: return data_frame_column.astype("int32") return data_frame_column def build_parquet_schema(column_name,column_type): table_list = list() for index, column in enumerate(column_name): if "VARCHAR" in column_type[index]: table_list.append((column, pa.string())) elif "INT" in column_type[index]: table_list.append((column,pa.int32())) else: table_list.append((column, pa.string())) return pa.schema(table_list) if __name__ == '__main__': parquet_file ="test.parquet" column_type,column_name,data_list = list(),list(),list() for i in range(0,20): column_name.append("TEST%s"%i) column_type.append("VARCHAR") if i%2==0 else column_type.append("INT") table_schema = build_parquet_schema(column_name,column_type) for i in range(0,3*1000*1000): data_list.append(list(map(create_list,column_type))) test_panda_frame = pd.DataFrame(data_list, columns=tuple(column_name)) for index, column in enumerate(column_name): test_panda_frame[column] = chang_column_type(column_type[index], test_panda_frame[column]) table = pa.Table.from_pandas(test_panda_frame, schema=table_schema) pq.write_table(table, parquet_file,row_group_size=300*1000*1000) exit(0) ``` I run it in ubuntu 9.4.0 and use python3.8, pyarrow 7.0.0 You can use this to generate a test.parquet file and read any multiple columns with using `_reader->set_use_threads(true);` @pitrou @westonpace -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org