Forshining opened a new issue, #44059:
URL: https://github.com/apache/arrow/issues/44059
### Describe the usage question you have. Please include as many useful
details as possible.
My question description is as follows:
I have a parquet file that has already contains the __index_level_0__
colomn. Now I try to delete some of the rows that are unmatched to some
criterion and turn the updated dataframe into another new parquet file. The
code is as follows:
```python
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import os
## select the images with aesthetic score >= 5
for m in range(100,128):
# table_temp =
pq.ParquetFile('part-00000-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet')
# column_names = table_temp.schema.names
# print("Column names:", column_names)
read_file =
f'part-00{m}-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet'
table_temp_pandas = pq.read_table(read_file).to_pandas()
index_to_be_deleted = []
for i in range(table_temp_pandas.shape[0]):
if table_temp_pandas.loc[i, 'AESTHETIC_SCORE'] < 5:
index_to_be_deleted.append(i)
table_dropped =
table_temp_pandas.drop(table_temp_pandas.index[index_to_be_deleted])
split_string = read_file.split("-")
write_file = "new-part-" + split_string[1] + ".parquet"
table_dropped_p = pa.Table.from_pandas(table_dropped,
preserve_index=False)
print(f"Writing new table into parquet......:{m+1}/127")
pq.write_table(table_dropped_p, write_file)
print("Completed")
print(f"Staring to remove the former parquet file:
part-00{m}-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet")
os.remove(f"part-00{m}-4e217ab5-40f3-4738-ac05-c1cb9f75ef32-c000.snappy.parquet")
print("File removed successfully!")
```
However, when I tried to read the new parquet files with the following codes:
```python
read_file = "./datasets--laion--aesthetics_v2_4.75/new-part-00000.parquet"
table_temp_pandas = pq.read_table(read_file).to_pandas()
```
here comes an error:
```bash
pyarrow.lib.ArrowInvalid: Multiple matches for
FieldRef.Name(__index_level_0__)
```
After I check the new dataframe, I found that there are two same colomns of
__index_level_0__, one from the original parquet file and another from
write_file() operations.
However, my situation is that I do not have so much storage to
simultaneously store the original and new parquet files. Therefore, is there
any solution to directly delete one of the __index_level_0__ colomns only based
on the new parquet files?
I would appreciate anyone's precious time and help!
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]