[GitHub] [arrow] westonpace commented on issue #33188: [Parquet][C++][Python] "List index overflow" when read parquet file

via GitHub Thu, 29 Jun 2023 09:00:36 -0700


westonpace commented on issue #33188:
URL: https://github.com/apache/arrow/issues/33188#issuecomment-1613460257


   > is there a pyarrow API for that?
   
   Are you creating these tables in python?  You could cast the columns:
   
   ```
   import pandas as pd
   import pyarrow as pa
   import pyarrow.compute as pc
   import pyarrow.parquet as pq
   import numpy as np
   
   big_arr = np.zeros(1024*1024*1024, dtype=np.int8)
   straw_that_broke_the_camels_back = np.zeros(1, dtype=np.int8)
   big_series = pd.Series([big_arr, big_arr, straw_that_broke_the_camels_back])
   big_df = pd.DataFrame({"big": big_series})
   
   # Will not be readable by pyarrow                                            
                                                                                
                                                      
   big_df.to_parquet("/tmp/unreadable.parquet")
   
   big_table = pa.Table.from_pandas(big_df)
   new_columns = []
   for column in big_table.columns:
       if isinstance(column.type, pa.ListType):
           new_columns.append(pc.cast(column, 
pa.large_list(column.type.value_type)))
       else:
           new_columns.append(column)
   
   new_table = pa.Table.from_arrays(new_columns, names=big_table.schema.names)
   # This will contain the same data but use large_list and thus will be 
readable                                                                        
                                                             
   pq.write_table(new_table, "/tmp/readable.parquet")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #33188: [Parquet][C++][Python] "List index overflow" when read parquet file

Reply via email to