yurikoomiga commented on issue #13186:
URL: https://github.com/apache/arrow/issues/13186#issuecomment-1135743977
> @yurikoomiga Can you post a sample file that fails somewhere? (or code to
reproduce the generation of the file)
I'm sorry to reply you after so long.
The sample file is so large, so I post the generating code like this:
```
import random, string
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def create_list(type):
if type == "VARCHAR":
num = string.ascii_letters + string.digits
return "".join(random.sample(num, random.randint(1, 20)))
elif type == "INT":
return random.randint(1,65536)
def chang_column_type(column_type,data_frame_column):
if "INT" in column_type:
return data_frame_column.astype("int32")
return data_frame_column
def build_parquet_schema(column_name,column_type):
table_list = list()
for index, column in enumerate(column_name):
if "VARCHAR" in column_type[index]:
table_list.append((column, pa.string()))
elif "INT" in column_type[index]:
table_list.append((column,pa.int32()))
else:
table_list.append((column, pa.string()))
return pa.schema(table_list)
if __name__ == '__main__':
parquet_file ="test.parquet"
column_type,column_name,data_list = list(),list(),list()
for i in range(0,20):
column_name.append("TEST%s"%i)
column_type.append("VARCHAR") if i%2==0 else
column_type.append("INT")
table_schema = build_parquet_schema(column_name,column_type)
for i in range(0,3*1000*1000):
data_list.append(list(map(create_list,column_type)))
test_panda_frame = pd.DataFrame(data_list, columns=tuple(column_name))
for index, column in enumerate(column_name):
test_panda_frame[column] = chang_column_type(column_type[index],
test_panda_frame[column])
table = pa.Table.from_pandas(test_panda_frame, schema=table_schema)
pq.write_table(table, parquet_file,row_group_size=300*1000*1000)
exit(0)
```
I run it in ubuntu 9.4.0 and use python3.8, pyarrow 7.0.0
You can use this to generate a test.parquet file and read any multiple
columns with using `_reader->set_use_threads(true);`
@pitrou @westonpace
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]