cheolgook opened a new issue, #37509:
URL: https://github.com/apache/arrow/issues/37509

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have a python3 script reading from JSON log files and storing them in 
parquet with pyarrow schema. Recently ran into this weird case with the error 
in title, `ArrowTypeError: "object of type <class 'str'> cannot be converted to 
int"`.
   
   Here's the log file:
   ```json, pa_bad.js
   
{"uuid":"D4D8806A-4794-11EE-BC2C-31A44E74A5E9","timestamp":"2023-08-30T00:21:12+0000","version":"0.1","event_data":{"amount":27.1048,"user":19323160,"timestamp":"2023-08-30T00:21:12+0000"}}
   ```
   
   Here's my script:
   ```python3, pa.py
   #!/usr/bin/python3
   
   import pandas as pd
   import pyarrow as pa
   
   SCHEMA = pa.schema([
       ('uuid', pa.string()),
       ('timestamp', pa.timestamp('s', tz='UTC')),
       ('version', pa.float64()),
       ('event_data', pa.struct([
           ('amount', pa.float64()),
           ('user', pa.int64()),
           ('timestamp', pa.timestamp('s', tz='UTC')),
       ])),
   ])
   
   
   def main():
       df = pd.read_json('pa_bad.log', lines=True)
       df.to_parquet(path='pa.parquet', schema=SCHEMA)
   
   
   if __name__ == '__main__':
       main()
   ```
   
   and the error looks like this:
   ```shell
   $ python3 pa.py
   Traceback (most recent call last):
     File "pa.py", line 24, in <module>
       main()
     File "pa.py", line 20, in main
       df.to_parquet(path='pa.parquet', schema=SCHEMA)
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/util/_decorators.py",
 line 199, in wrapper
       return func(*args, **kwargs)
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/core/frame.py",
 line 2463, in to_parquet
       **kwargs,
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 397, in to_parquet
       **kwargs,
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pandas/io/parquet.py",
 line 152, in write
       table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
     File "pyarrow/table.pxi", line 3681, in pyarrow.lib.Table.from_pandas
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 612, in dataframe_to_arrays
       for c, f in zip(columns_to_convert, convert_fields)]
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 612, in <listcomp>
       for c, f in zip(columns_to_convert, convert_fields)]
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 598, in convert_column
       raise e
     File 
"/home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 592, in convert_column
       result = pa.array(col, type=type_, from_pandas=True, safe=safe)
     File "pyarrow/array.pxi", line 323, in pyarrow.lib.array
     File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
     File "pyarrow/error.pxi", line 123, in pyarrow.lib.check_status
   pyarrow.lib.ArrowTypeError: ("object of type <class 'str'> cannot be 
converted to int", 'Conversion failed for column event_data with type object')
   ```
   
   I was digging into the JSON file and found that it works if I remove the 
event_data.timestamp from the JSON log file:
   ```json, pa_good.log
   
{"uuid":"D4D8806A-4794-11EE-BC2C-31A44E74A5E9","timestamp":"2023-08-30T00:21:12+0000","version":"0.1","event_data":{"amount":27.1048,"user":19323160}}
   ```
   Please note that the schema entry for the `event_data.timestamp` in the 
`pa.py` has been removed as well.
   
   I tried these variaions of JSON input and they all worked with no problem.
   ```json
   {
     "uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9",
     "timestamp": "2023-08-30T00:21:12+0000",
     "version": "0.1",
     "event_data": {
       "amount": 27.1048,
       "user": 19323160,
       "uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9"  <-------- works
     }
   }
   ```
   ```json
   {
     "uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9",
     "timestamp": "2023-08-30T00:21:12+0000",
     "version": "0.1",
     "event_data": {
       "amount": 27.1048,
       "user": 19323160,
       "version": "0.1"  <-------- works
     }
   }
   ```
   
   Problem happens only when there is a timestamp value in a nested object:
   ```json
   {
     "uuid": "D4D8806A-4794-11EE-BC2C-31A44E74A5E9",
     "version": "0.1",
     "event": {
       "amount": 27.1048,
       "user": 19323160,
       "ts": "2023-08-30T00:21:12+0000" <----------- DOES NOT WORK
     }
   }
   ```
   
   Tested on these systems.
   ```shell
   # macOS
   
   $ pyenv local
   3.11.1
   
   $ pip show pandas pyarrow
   Name: pandas
   Version: 1.5.1
   Summary: Powerful data structures for data analysis, time series, and 
statistics
   Home-page: https://pandas.pydata.org
   Author: The Pandas Development Team
   Author-email: [email protected]
   License: BSD-3-Clause
   Location: /Users/user/.pyenv/versions/3.11.1/lib/python3.11/site-packages
   Requires: numpy, python-dateutil, pytz
   Required-by: parquet-tools
   ---
   Name: pyarrow
   Version: 13.0.0
   Summary: Python library for Apache Arrow
   Home-page: https://arrow.apache.org/
   Author:
   Author-email:
   License: Apache License, Version 2.0
   Location: /Users/user/.pyenv/versions/3.11.1/lib/python3.11/site-packages
   Requires: numpy
   Required-by: parquet-tools
   ```
   
   ```shell
   # CentOS 7
   
   $ pyenv local
   3.7.1
   
   $ pip show pandas pyarrow
   Name: pandas
   Version: 1.2.0
   Summary: Powerful data structures for data analysis, time series, and 
statistics
   Home-page: https://pandas.pydata.org
   Author: None
   Author-email: None
   License: BSD
   Location: /home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages
   Requires: python-dateutil, pytz, numpy
   Required-by:
   ---
   Name: pyarrow
   Version: 12.0.1
   Summary: Python library for Apache Arrow
   Home-page: https://arrow.apache.org/
   Author: None
   Author-email: None
   License: Apache License, Version 2.0
   Location: /home/user/.pyenv/versions/3.7.1/lib/python3.7/site-packages
   Requires: numpy
   Required-by: 
   ```
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to