marklit opened a new issue, #3433:
URL: https://github.com/apache/arrow-rs/issues/3433
Versions:
* json2parquet 0.6.0 with the following Cargo packages:
* parquet = "29.0.0"
* arrow = "29.0.0"
* arrow-schema = { version = "29.0.0", features = ["serde"] }
* PyArrow 10.0.1
* ClickHouse 22.13.1.1119
```bash
$ vi test.jsonl
```
```json
{"area": 123, "geom": "", "centroid_x": -86.86346599122807, "centroid_y":
34.751296108771925, "h3_7": "872649315ffffff", "h3_8": "882649315dfffff",
"h3_9": "892649315cbffff"}
```
```bash
$ json2parquet -c lz4 test.jsonl lz4.pq
$ ls -lth lz4.pq # 2.5K
```
```
$ hexdump -C lz4.pq | head; echo; hexdump -C lz4.pq | tail
```
```
00000000 50 41 52 31 15 00 15 1c 15 42 2c 15 02 15 00 15
|PAR1.....B,.....|
00000010 06 15 06 1c 58 08 7b 00 00 00 00 00 00 00 18 08
|....X.{.........|
00000020 7b 00 00 00 00 00 00 00 00 00 00 04 22 4d 18 44
|{..........."M.D|
00000030 40 5e 0e 00 00 80 02 00 00 00 02 01 7b 00 00 00
|@^..........{...|
00000040 00 00 00 00 00 00 00 00 e4 c0 1d d2 15 04 19 25
|...............%|
00000050 00 06 19 18 04 61 72 65 61 15 0a 16 02 16 6a 16
|.....area.....j.|
00000060 90 01 26 08 3c 58 08 7b 00 00 00 00 00 00 00 18
|..&.<X.{........|
00000070 08 7b 00 00 00 00 00 00 00 00 00 15 00 15 1c 15
|.{..............|
00000080 42 2c 15 02 15 00 15 06 15 06 1c 58 08 19 62 dc
|B,.........X..b.|
00000090 06 43 b7 55 c0 18 08 19 62 dc 06 43 b7 55 c0 00
|.C.U....b..C.U..|
00000970 41 42 51 41 45 41 41 4f 41 41 38 41 42 41 41 41
|ABQAEAAOAA8ABAAA|
00000980 41 41 67 41 45 41 41 41 41 42 67 41 41 41 41 67
|AAgAEAAAABgAAAAg|
00000990 41 41 41 41 41 41 41 42 41 68 77 41 41 41 41 49
|AAAAAAABAhwAAAAI|
000009a0 41 41 77 41 42 41 41 4c 41 41 67 41 41 41 42 41
|AAwABAALAAgAAABA|
000009b0 41 41 41 41 41 41 41 41 41 51 41 41 41 41 41 45
|AAAAAAAAAQAAAAAE|
000009c0 41 41 41 41 59 58 4a 6c 59 51 41 41 41 41 41 3d
|AAAAYXJlYQAAAAA=|
000009d0 00 18 19 70 61 72 71 75 65 74 2d 72 73 20 76 65 |...parquet-rs
ve|
000009e0 72 73 69 6f 6e 20 32 33 2e 30 2e 30 00 fe 04 00 |rsion
23.0.0....|
000009f0 00 50 41 52 31 |.PAR1|
000009f5
```
```bash
$ ipython
```
```python
In [1]: import pyarrow.parquet as pq
In [2]: pf = pq.ParquetFile('lz4.pq')
In [3]: pf
Out[3]: <pyarrow.parquet.core.ParquetFile at 0x10ca1cd90>
In [4]: pf.schema
Out[4]:
<pyarrow._parquet.ParquetSchema object at 0x10e74b280>
required group field_id=-1 arrow_schema {
optional int64 field_id=-1 area;
optional double field_id=-1 centroid_x;
optional double field_id=-1 centroid_y;
optional binary field_id=-1 geom (String);
optional binary field_id=-1 h3_7 (String);
optional binary field_id=-1 h3_8 (String);
optional binary field_id=-1 h3_9 (String);
}
In [6]: pf.read()
# OSError: Corrupt Lz4 compressed data.
```
```bash
$ clickhouse client
```
```sql
CREATE TABLE pq_test (
area Nullable(Int64),
centroid_x Nullable(Float64),
centroid_y Nullable(Float64),
geom Nullable(String),
h3_7 Nullable(String),
h3_8 Nullable(String),
h3_9 Nullable(String))
ENGINE = "Log";
```
```bash
$ clickhouse client \
--query='INSERT INTO pq_test FORMAT Parquet' \
< lz4.pq
```
```
Code: 33. DB::ParsingException: Error while reading Parquet data: IOError:
Corrupt Lz4 compressed data.: While executing ParquetBlockInputFormat: data for
INSERT was parsed from stdin: (in query: INSERT INTO pq_test FORMAT Parquet).
(CANNOT_READ_ALL_DATA)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]