marklit opened a new issue, #3433:
URL: https://github.com/apache/arrow-rs/issues/3433

   Versions:
   
   * json2parquet 0.6.0 with the following Cargo packages:
     * parquet = "29.0.0"
     * arrow = "29.0.0"
     * arrow-schema = { version = "29.0.0", features = ["serde"] }
   * PyArrow 10.0.1
   * ClickHouse 22.13.1.1119
   
   ```bash
   $ vi test.jsonl
   ```
   
   ```json
   {"area": 123, "geom": "", "centroid_x": -86.86346599122807, "centroid_y": 
34.751296108771925, "h3_7": "872649315ffffff", "h3_8": "882649315dfffff", 
"h3_9": "892649315cbffff"}
   ```
   
   ```bash
   $ json2parquet -c lz4 test.jsonl lz4.pq
   $ ls -lth lz4.pq # 2.5K
   ```
   
   ```
   $ hexdump -C lz4.pq | head; echo; hexdump -C lz4.pq | tail
   ```
   
   ```
   00000000  50 41 52 31 15 00 15 1c  15 42 2c 15 02 15 00 15  
|PAR1.....B,.....|
   00000010  06 15 06 1c 58 08 7b 00  00 00 00 00 00 00 18 08  
|....X.{.........|
   00000020  7b 00 00 00 00 00 00 00  00 00 00 04 22 4d 18 44  
|{..........."M.D|
   00000030  40 5e 0e 00 00 80 02 00  00 00 02 01 7b 00 00 00  
|@^..........{...|
   00000040  00 00 00 00 00 00 00 00  e4 c0 1d d2 15 04 19 25  
|...............%|
   00000050  00 06 19 18 04 61 72 65  61 15 0a 16 02 16 6a 16  
|.....area.....j.|
   00000060  90 01 26 08 3c 58 08 7b  00 00 00 00 00 00 00 18  
|..&.<X.{........|
   00000070  08 7b 00 00 00 00 00 00  00 00 00 15 00 15 1c 15  
|.{..............|
   00000080  42 2c 15 02 15 00 15 06  15 06 1c 58 08 19 62 dc  
|B,.........X..b.|
   00000090  06 43 b7 55 c0 18 08 19  62 dc 06 43 b7 55 c0 00  
|.C.U....b..C.U..|
   
   00000970  41 42 51 41 45 41 41 4f  41 41 38 41 42 41 41 41  
|ABQAEAAOAA8ABAAA|
   00000980  41 41 67 41 45 41 41 41  41 42 67 41 41 41 41 67  
|AAgAEAAAABgAAAAg|
   00000990  41 41 41 41 41 41 41 42  41 68 77 41 41 41 41 49  
|AAAAAAABAhwAAAAI|
   000009a0  41 41 77 41 42 41 41 4c  41 41 67 41 41 41 42 41  
|AAwABAALAAgAAABA|
   000009b0  41 41 41 41 41 41 41 41  41 51 41 41 41 41 41 45  
|AAAAAAAAAQAAAAAE|
   000009c0  41 41 41 41 59 58 4a 6c  59 51 41 41 41 41 41 3d  
|AAAAYXJlYQAAAAA=|
   000009d0  00 18 19 70 61 72 71 75  65 74 2d 72 73 20 76 65  |...parquet-rs 
ve|
   000009e0  72 73 69 6f 6e 20 32 33  2e 30 2e 30 00 fe 04 00  |rsion 
23.0.0....|
   000009f0  00 50 41 52 31                                    |.PAR1|
   000009f5
   ```
   
   ```bash
   $ ipython
   ```
   
   ```python
   In [1]: import pyarrow.parquet as pq
   
   In [2]: pf = pq.ParquetFile('lz4.pq')
   
   In [3]: pf
   Out[3]: <pyarrow.parquet.core.ParquetFile at 0x10ca1cd90>
   
   In [4]: pf.schema
   Out[4]:
   <pyarrow._parquet.ParquetSchema object at 0x10e74b280>
   required group field_id=-1 arrow_schema {
     optional int64 field_id=-1 area;
     optional double field_id=-1 centroid_x;
     optional double field_id=-1 centroid_y;
     optional binary field_id=-1 geom (String);
     optional binary field_id=-1 h3_7 (String);
     optional binary field_id=-1 h3_8 (String);
     optional binary field_id=-1 h3_9 (String);
   }
   
   In [6]: pf.read()
   
   # OSError: Corrupt Lz4 compressed data.
   ```
   
   ```bash
   $ clickhouse client
   ```
   
   ```sql
   CREATE TABLE pq_test (
       area Nullable(Int64),
       centroid_x Nullable(Float64),
       centroid_y Nullable(Float64),
       geom Nullable(String),
       h3_7 Nullable(String),
       h3_8 Nullable(String),
       h3_9 Nullable(String))
   ENGINE = "Log";
   ```
   
   ```bash
   $ clickhouse client \
       --query='INSERT INTO pq_test FORMAT Parquet' \
       < lz4.pq
   ```
   
   ```
   Code: 33. DB::ParsingException: Error while reading Parquet data: IOError: 
Corrupt Lz4 compressed data.: While executing ParquetBlockInputFormat: data for 
INSERT was parsed from stdin: (in query: INSERT INTO pq_test FORMAT Parquet). 
(CANNOT_READ_ALL_DATA)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to