[
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967438#comment-16967438
]
Joris Van den Bossche commented on ARROW-7063:
----------------------------------------------
I also ran into this recently when looking at the reports involving a huge
number of columns (although that was in Python, and I see that we don't use the
exact same code as the C++ pretty printer:
https://github.com/apache/arrow/blob/e0cc9c43276840579a29332aca7348bbc415c051/python/pyarrow/types.pxi#L1245-L1264).
We should probably at least truncate the metadata. Personally I would prefer
truncating them (so they don't get annoying) instead of not showing them at
all, as IMO it is useful to see that the table has metadata.
We could for example truncate each entry to a max of 50 characters (adding
{{...}}) while still showing all entries (all keys).
{quote}And IDK what to do with this {{ARROW:schema: }} business but it's
clearly not readable as is.{quote}
It's a the original arrow schema in serialized format. Example with python how
it is created when writing a parquet file, and retrieving it again:
{code}
In [33]: import pyarrow as pa
In [34]: table = pa.table(pd.DataFrame({'a': [1, 2, 3]}))
In [35]: table
Out[35]:
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
b'.g157495696.dirty"}'}
In [36]: import pyarrow.parquet as pq
In [37]: pq.write_table(table, 'test.parquet')
In [39]: schema = pq.read_schema('test.parquet')
In [40]: schema
Out[40]:
a: int64
metadata
--------
{b'ARROW:schema': b'/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwA'
b'AAAEAAgACgAAAAgCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAA'
b'EAAAAAYAAABwYW5kYXMAANMBAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJr'
b'aW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAi'
b'c3RvcCI6IDMsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBb'
b'eyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFz'
b'X3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIs'
b'ICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29s'
b'dW1ucyI6IFt7Im5hbWUiOiAiYSIsICJmaWVsZF9uYW1lIjogImEiLCAi'
b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'
b'NCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJh'
b'cnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjAuMTUuMS5kZXYyMTIr'
b'ZzRhZmU5ZjBlYSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMC4yNi4wLmRl'
b'djArNjkxLmcxNTc0OTU2OTYuZGlydHkifQABAAAAFAAAABAAFAAIAAYA'
b'BwAMAAAAEAAQAAAAAAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgA'
b'AAAAAAABQAAAAAEAAABhAAAA',
b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
b'.g157495696.dirty"}'}
In [44]: original_schema_encoded = schema.metadata[b'ARROW:schema']
In [45]: import base64
In [46]: original_schema =
pa.read_schema(pa.BufferReader(base64.b64decode(original_schema_encoded)))
In [47]: original_schema
Out[47]:
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
b'.g157495696.dirty"}'}
{code}
> [C++] Schema print method prints too much metadata
> --------------------------------------------------
>
> Key: ARROW-7063
> URL: https://issues.apache.org/jira/browse/ARROW-7063
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, C++ - Dataset
> Reporter: Neal Richardson
> Priority: Minor
> Labels: dataset, parquet
> Fix For: 1.0.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0,
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null,
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object",
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id",
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object",
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at",
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null},
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime",
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name":
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8",
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance",
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type":
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name":
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32",
> "metadata": null}, {"name": "pickup_latitude", "field_name":
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32",
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id",
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name":
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type":
> "unicode", "numpy_type": "object", "metadata": null}, {"name":
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type":
> "float32", "numpy_type": "float32", "metadata": null}, {"name":
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type":
> "float32", "numpy_type": "float32", "metadata": null}, {"name":
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode",
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount",
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type":
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra",
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null},
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32",
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount",
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type":
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name":
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32",
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount",
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}],
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version":
> "0.25.3"}
> ARROW:schema:
> /////3gOAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAEAAgACgAAAFQKAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAsCgAABAAAAB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9hdCIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicGFzc2VuZ2VyX2NvdW50IiwgImZpZWxkX25hbWUiOiAicGFzc2VuZ2VyX2NvdW50IiwgInBhbmRhc190eXBlIjogImludDgiLCAibnVtcHlfdHlwZSI6ICJpbnQ4IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJ0cmlwX2Rpc3RhbmNlIiwgImZpZWxkX25hbWUiOiAidHJpcF9kaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicGlja3VwX2xvbmdpdHVkZSIsICJmaWVsZF9uYW1lIjogInBpY2t1cF9sb25naXR1ZGUiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInBpY2t1cF9sYXRpdHVkZSIsICJmaWVsZF9uYW1lIjogInBpY2t1cF9sYXRpdHVkZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicmF0ZV9jb2RlX2lkIiwgImZpZWxkX25hbWUiOiAicmF0ZV9jb2RlX2lkIiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAiZmllbGRfbmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9sb25naXR1ZGUiLCAiZmllbGRfbmFtZSI6ICJkcm9wb2ZmX2xvbmdpdHVkZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9sYXRpdHVkZSIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfbGF0aXR1ZGUiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInBheW1lbnRfdHlwZSIsICJmaWVsZF9uYW1lIjogInBheW1lbnRfdHlwZSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJmYXJlX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogImZhcmVfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJleHRyYSIsICJmaWVsZF9uYW1lIjogImV4dHJhIiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJtdGFfdGF4IiwgImZpZWxkX25hbWUiOiAibXRhX3RheCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidGlwX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogInRpcF9hbW91bnQiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInRvbGxzX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogInRvbGxzX2Ftb3VudCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidG90YWxfYW1vdW50IiwgImZpZWxkX25hbWUiOiAidG90YWxfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIwLjI1LjMifQAGAAAAcGFuZGFzAAASAAAAxAMAAHgDAABEAwAAAAMAAMgCAACMAgAAVAIAACACAADoAQAArAEAAHABAAA8AQAACAEAANgAAACoAAAAdAAAADwAAAAEAAAAlPz//wAAAQMYAAAADAAAAAQAAAAAAAAAyvz//wAAAQAMAAAAdG90YWxfYW1vdW50AAAAAMj8//8AAAEDGAAAAAwAAAAEAAAAAAAAAP78//8AAAEADAAAAHRvbGxzX2Ftb3VudAAAAAD8/P//AAABAxgAAAAMAAAABAAAAAAAAAAy/f//AAABAAoAAAB0aXBfYW1vdW50AAAs/f//AAABAxgAAAAMAAAABAAAAAAAAABi/f//AAABAAcAAABtdGFfdGF4AFj9//8AAAEDGAAAAAwAAAAEAAAAAAAAAI79//8AAAEABQAAAGV4dHJhAAAAhP3//wAAAQMYAAAADAAAAAQAAAAAAAAAuv3//wAAAQALAAAAZmFyZV9hbW91bnQAtP3//wAAAQUUAAAADAAAAAQAAAAAAAAApP3//wwAAABwYXltZW50X3R5cGUAAAAA5P3//wAAAQMYAAAADAAAAAQAAAAAAAAAGv7//wAAAQAQAAAAZHJvcG9mZl9sYXRpdHVkZQAAAAAc/v//AAABAxgAAAAMAAAABAAAAAAAAABS/v//AAABABEAAABkcm9wb2ZmX2xvbmdpdHVkZQAAAFT+//8AAAEFFAAAAAwAAAAEAAAAAAAAAET+//8SAAAAc3RvcmVfYW5kX2Z3ZF9mbGFnAACI/v//AAABARQAAAAMAAAABAAAAAAAAAB4/v//DAAAAHJhdGVfY29kZV9pZAAAAAC4/v//AAABAxgAAAAMAAAABAAAAAAAAADu/v//AAABAA8AAABwaWNrdXBfbGF0aXR1ZGUA7P7//wAAAQMYAAAADAAAAAQAAAAAAAAAIv///wAAAQAQAAAAcGlja3VwX2xvbmdpdHVkZQAAAAAk////AAABAxgAAAAMAAAABAAAAAAAAABa////AAABAA0AAAB0cmlwX2Rpc3RhbmNlAAAAWP///wAAAQIkAAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAQgAAAAPAAAAcGFzc2VuZ2VyX2NvdW50AJj///8AAAEKGAAAAAwAAAAEAAAAAAAAAM7///8AAAMACgAAAGRyb3BvZmZfYXQAAMj///8AAAEKIAAAABQAAAAEAAAAAAAAAAAABgAIAAYABgAAAAAAAwAJAAAAcGlja3VwX2F0AAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAACQAAAHZlbmRvcl9pZAAAAA==
> {code}
> I'd argue that extra metadata, if it's not part of the Arrow format and can
> be whatever an application wants to put in there, should not be printed as
> part of the schema's ToString method. It should be viewable some way, just
> not always. And IDK what to do with this {{ARROW:schema: }} business but it's
> clearly not readable as is.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)