[I] [Python] How to delete one of the duplicated __index_level_0__? [arrow]

via GitHub Wed, 11 Sep 2024 00:16:25 -0700


Forshining opened a new issue, #44059:
URL: https://github.com/apache/arrow/issues/44059


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   My question description is as follows:
   
   I have a parquet file that has already contains the __index_level_0__ 
colomn. Now I try to delete some of the rows that are unmatched to some 
criterion and turn the updated dataframe into another new parquet file. The 
code is as follows:
   
   ```python 
   import pyarrow.parquet as pq 
   import pyarrow as pa 
   import pandas as pd
   import os 
   
   ## select the images with aesthetic score >= 5
   for m in range(100,128):
   
       # table_temp = 
pq.ParquetFile('part-00000-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet')
       # column_names = table_temp.schema.names
       # print("Column names:", column_names)
   
       read_file = 
f'part-00{m}-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet'
   
       table_temp_pandas = pq.read_table(read_file).to_pandas()
   
       index_to_be_deleted = []
   
       for i in range(table_temp_pandas.shape[0]):
           if table_temp_pandas.loc[i, 'AESTHETIC_SCORE'] < 5:
               index_to_be_deleted.append(i)
   
       table_dropped = 
table_temp_pandas.drop(table_temp_pandas.index[index_to_be_deleted])
   
       split_string = read_file.split("-")
   
       write_file = "new-part-" + split_string[1] + ".parquet"
   
       table_dropped_p = pa.Table.from_pandas(table_dropped, 
preserve_index=False)
   
       print(f"Writing new table into parquet......:{m+1}/127")
       pq.write_table(table_dropped_p, write_file)
       print("Completed")
   
       print(f"Staring to remove the former parquet file: 
part-00{m}-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet")
       
os.remove(f"part-00{m}-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet")
       print("File removed successfully!")
   ```
   
   However, when I tried to read the new parquet files with the following codes:
   
   ```python 
   read_file = "./datasets--laion--aesthetics_v2_4.75/new-part-00000.parquet"
   table_temp_pandas = pq.read_table(read_file).to_pandas()
   ```
   
   here comes an error:
   
   ```bash 
   pyarrow.lib.ArrowInvalid: Multiple matches for 
FieldRef.Name(__index_level_0__)
   ```
   
   After I check the new dataframe, I found that there are two same colomns of 
__index_level_0__, one from the original parquet file and another from 
write_file() operations. 
   
   However, my situation is that I do not have so much storage to 
simultaneously store the original and new parquet files. Therefore, is there 
any solution to directly delete one of the __index_level_0__ colomns only based 
on the new parquet files?
   
   I would appreciate anyone's precious time and help!
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Python] How to delete one of the duplicated __index_level_0__? [arrow]

Reply via email to