[ 
https://issues.apache.org/jira/browse/ARROW-7063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967438#comment-16967438
 ] 

Joris Van den Bossche commented on ARROW-7063:
----------------------------------------------

I also ran into this recently when looking at the reports involving a huge 
number of columns (although that was in Python, and I see that we don't use the 
exact same code as the C++ pretty printer: 
https://github.com/apache/arrow/blob/e0cc9c43276840579a29332aca7348bbc415c051/python/pyarrow/types.pxi#L1245-L1264).
 

We should probably at least truncate the metadata. Personally I would prefer 
truncating them (so they don't get annoying) instead of not showing them at 
all, as IMO it is useful to see that the table has metadata.  
We could for example truncate each entry to a max of 50 characters (adding 
{{...}}) while still showing all entries (all keys).

{quote}And IDK what to do with this {{ARROW:schema: }} business but it's 
clearly not readable as is.{quote}

It's a the original arrow schema in serialized format. Example with python how 
it is created when writing a parquet file, and retrieving it again:

{code}
In [33]: import pyarrow as pa                                                   
                                                                                
                                                   

In [34]: table = pa.table(pd.DataFrame({'a': [1, 2, 3]}))                       
                                                                                
                                                   

In [35]: table                                                                  
                                                                                
                                                   
Out[35]: 
pyarrow.Table
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
            b'.g157495696.dirty"}'}

In [36]: import pyarrow.parquet as pq                                           
                                                                                
                                                   

In [37]: pq.write_table(table, 'test.parquet')                                  
                                                                                
                                                   

In [39]: schema = pq.read_schema('test.parquet')                                
                                                                                
                                                   

In [40]: schema                                                                 
                                                                                
                                                   
Out[40]: 
a: int64
metadata
--------
{b'ARROW:schema': b'/////4ACAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwA'
                  b'AAAEAAgACgAAAAgCAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAA'
                  b'EAAAAAYAAABwYW5kYXMAANMBAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJr'
                  b'aW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAi'
                  b'c3RvcCI6IDMsICJzdGVwIjogMX1dLCAiY29sdW1uX2luZGV4ZXMiOiBb'
                  b'eyJuYW1lIjogbnVsbCwgImZpZWxkX25hbWUiOiBudWxsLCAicGFuZGFz'
                  b'X3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIs'
                  b'ICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifX1dLCAiY29s'
                  b'dW1ucyI6IFt7Im5hbWUiOiAiYSIsICJmaWVsZF9uYW1lIjogImEiLCAi'
                  b'cGFuZGFzX3R5cGUiOiAiaW50NjQiLCAibnVtcHlfdHlwZSI6ICJpbnQ2'
                  b'NCIsICJtZXRhZGF0YSI6IG51bGx9XSwgImNyZWF0b3IiOiB7ImxpYnJh'
                  b'cnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjAuMTUuMS5kZXYyMTIr'
                  b'ZzRhZmU5ZjBlYSJ9LCAicGFuZGFzX3ZlcnNpb24iOiAiMC4yNi4wLmRl'
                  b'djArNjkxLmcxNTc0OTU2OTYuZGlydHkifQABAAAAFAAAABAAFAAIAAYA'
                  b'BwAMAAAAEAAQAAAAAAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgA'
                  b'AAAAAAABQAAAAAEAAABhAAAA',
 b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
            b'.g157495696.dirty"}'}

In [44]: original_schema_encoded = schema.metadata[b'ARROW:schema']             
                                                                                
                                                    

In [45]: import base64                                                          
                                                                                
                                                   

In [46]: original_schema = 
pa.read_schema(pa.BufferReader(base64.b64decode(original_schema_encoded)))      
                                                                                
                                

In [47]: original_schema                                                        
                                                                                
                                                   
Out[47]: 
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev212+g4afe9f0ea"}, "pandas_version": "0.26.0.dev0+691'
            b'.g157495696.dirty"}'}

{code}

> [C++] Schema print method prints too much metadata
> --------------------------------------------------
>
>                 Key: ARROW-7063
>                 URL: https://issues.apache.org/jira/browse/ARROW-7063
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, C++ - Dataset
>            Reporter: Neal Richardson
>            Priority: Minor
>              Labels: dataset, parquet
>             Fix For: 1.0.0
>
>
> I loaded some taxi data in a Dataset and printed the schema. This is what was 
> printed:
> {code}
> vendor_id: string
> pickup_at: timestamp[us]
> dropoff_at: timestamp[us]
> passenger_count: int8
> trip_distance: float
> pickup_longitude: float
> pickup_latitude: float
> rate_code_id: null
> store_and_fwd_flag: string
> dropoff_longitude: float
> dropoff_latitude: float
> payment_type: string
> fare_amount: float
> extra: float
> mta_tax: float
> tip_amount: float
> tolls_amount: float
> total_amount: float
> -- metadata --
> pandas: {"index_columns": [{"kind": "range", "name": null, "start": 0, 
> "stop": 14387371, "step": 1}], "column_indexes": [{"name": null, 
> "field_name": null, "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "vendor_id", 
> "field_name": "vendor_id", "pandas_type": "unicode", "numpy_type": "object", 
> "metadata": null}, {"name": "pickup_at", "field_name": "pickup_at", 
> "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, 
> {"name": "dropoff_at", "field_name": "dropoff_at", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": 
> "passenger_count", "field_name": "passenger_count", "pandas_type": "int8", 
> "numpy_type": "int8", "metadata": null}, {"name": "trip_distance", 
> "field_name": "trip_distance", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "pickup_longitude", "field_name": 
> "pickup_longitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "pickup_latitude", "field_name": 
> "pickup_latitude", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "rate_code_id", "field_name": "rate_code_id", 
> "pandas_type": "empty", "numpy_type": "object", "metadata": null}, {"name": 
> "store_and_fwd_flag", "field_name": "store_and_fwd_flag", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "dropoff_longitude", "field_name": "dropoff_longitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "dropoff_latitude", "field_name": "dropoff_latitude", "pandas_type": 
> "float32", "numpy_type": "float32", "metadata": null}, {"name": 
> "payment_type", "field_name": "payment_type", "pandas_type": "unicode", 
> "numpy_type": "object", "metadata": null}, {"name": "fare_amount", 
> "field_name": "fare_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "extra", "field_name": "extra", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}, 
> {"name": "mta_tax", "field_name": "mta_tax", "pandas_type": "float32", 
> "numpy_type": "float32", "metadata": null}, {"name": "tip_amount", 
> "field_name": "tip_amount", "pandas_type": "float32", "numpy_type": 
> "float32", "metadata": null}, {"name": "tolls_amount", "field_name": 
> "tolls_amount", "pandas_type": "float32", "numpy_type": "float32", 
> "metadata": null}, {"name": "total_amount", "field_name": "total_amount", 
> "pandas_type": "float32", "numpy_type": "float32", "metadata": null}], 
> "creator": {"library": "pyarrow", "version": "0.15.1"}, "pandas_version": 
> "0.25.3"}
> ARROW:schema: 
> /////3gOAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAAAEAAgACgAAAFQKAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAsCgAABAAAAB8KAAB7ImluZGV4X2NvbHVtbnMiOiBbeyJraW5kIjogInJhbmdlIiwgIm5hbWUiOiBudWxsLCAic3RhcnQiOiAwLCAic3RvcCI6IDE0Mzg3MzcxLCAic3RlcCI6IDF9XSwgImNvbHVtbl9pbmRleGVzIjogW3sibmFtZSI6IG51bGwsICJmaWVsZF9uYW1lIjogbnVsbCwgInBhbmRhc190eXBlIjogInVuaWNvZGUiLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAibWV0YWRhdGEiOiB7ImVuY29kaW5nIjogIlVURi04In19XSwgImNvbHVtbnMiOiBbeyJuYW1lIjogInZlbmRvcl9pZCIsICJmaWVsZF9uYW1lIjogInZlbmRvcl9pZCIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJwaWNrdXBfYXQiLCAiZmllbGRfbmFtZSI6ICJwaWNrdXBfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9hdCIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfYXQiLCAicGFuZGFzX3R5cGUiOiAiZGF0ZXRpbWUiLCAibnVtcHlfdHlwZSI6ICJkYXRldGltZTY0W25zXSIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicGFzc2VuZ2VyX2NvdW50IiwgImZpZWxkX25hbWUiOiAicGFzc2VuZ2VyX2NvdW50IiwgInBhbmRhc190eXBlIjogImludDgiLCAibnVtcHlfdHlwZSI6ICJpbnQ4IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJ0cmlwX2Rpc3RhbmNlIiwgImZpZWxkX25hbWUiOiAidHJpcF9kaXN0YW5jZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicGlja3VwX2xvbmdpdHVkZSIsICJmaWVsZF9uYW1lIjogInBpY2t1cF9sb25naXR1ZGUiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInBpY2t1cF9sYXRpdHVkZSIsICJmaWVsZF9uYW1lIjogInBpY2t1cF9sYXRpdHVkZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAicmF0ZV9jb2RlX2lkIiwgImZpZWxkX25hbWUiOiAicmF0ZV9jb2RlX2lkIiwgInBhbmRhc190eXBlIjogImVtcHR5IiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAiZmllbGRfbmFtZSI6ICJzdG9yZV9hbmRfZndkX2ZsYWciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSIsICJudW1weV90eXBlIjogIm9iamVjdCIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9sb25naXR1ZGUiLCAiZmllbGRfbmFtZSI6ICJkcm9wb2ZmX2xvbmdpdHVkZSIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAiZHJvcG9mZl9sYXRpdHVkZSIsICJmaWVsZF9uYW1lIjogImRyb3BvZmZfbGF0aXR1ZGUiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInBheW1lbnRfdHlwZSIsICJmaWVsZF9uYW1lIjogInBheW1lbnRfdHlwZSIsICJwYW5kYXNfdHlwZSI6ICJ1bmljb2RlIiwgIm51bXB5X3R5cGUiOiAib2JqZWN0IiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJmYXJlX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogImZhcmVfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJleHRyYSIsICJmaWVsZF9uYW1lIjogImV4dHJhIiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH0sIHsibmFtZSI6ICJtdGFfdGF4IiwgImZpZWxkX25hbWUiOiAibXRhX3RheCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidGlwX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogInRpcF9hbW91bnQiLCAicGFuZGFzX3R5cGUiOiAiZmxvYXQzMiIsICJudW1weV90eXBlIjogImZsb2F0MzIiLCAibWV0YWRhdGEiOiBudWxsfSwgeyJuYW1lIjogInRvbGxzX2Ftb3VudCIsICJmaWVsZF9uYW1lIjogInRvbGxzX2Ftb3VudCIsICJwYW5kYXNfdHlwZSI6ICJmbG9hdDMyIiwgIm51bXB5X3R5cGUiOiAiZmxvYXQzMiIsICJtZXRhZGF0YSI6IG51bGx9LCB7Im5hbWUiOiAidG90YWxfYW1vdW50IiwgImZpZWxkX25hbWUiOiAidG90YWxfYW1vdW50IiwgInBhbmRhc190eXBlIjogImZsb2F0MzIiLCAibnVtcHlfdHlwZSI6ICJmbG9hdDMyIiwgIm1ldGFkYXRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMC4xNS4xIn0sICJwYW5kYXNfdmVyc2lvbiI6ICIwLjI1LjMifQAGAAAAcGFuZGFzAAASAAAAxAMAAHgDAABEAwAAAAMAAMgCAACMAgAAVAIAACACAADoAQAArAEAAHABAAA8AQAACAEAANgAAACoAAAAdAAAADwAAAAEAAAAlPz//wAAAQMYAAAADAAAAAQAAAAAAAAAyvz//wAAAQAMAAAAdG90YWxfYW1vdW50AAAAAMj8//8AAAEDGAAAAAwAAAAEAAAAAAAAAP78//8AAAEADAAAAHRvbGxzX2Ftb3VudAAAAAD8/P//AAABAxgAAAAMAAAABAAAAAAAAAAy/f//AAABAAoAAAB0aXBfYW1vdW50AAAs/f//AAABAxgAAAAMAAAABAAAAAAAAABi/f//AAABAAcAAABtdGFfdGF4AFj9//8AAAEDGAAAAAwAAAAEAAAAAAAAAI79//8AAAEABQAAAGV4dHJhAAAAhP3//wAAAQMYAAAADAAAAAQAAAAAAAAAuv3//wAAAQALAAAAZmFyZV9hbW91bnQAtP3//wAAAQUUAAAADAAAAAQAAAAAAAAApP3//wwAAABwYXltZW50X3R5cGUAAAAA5P3//wAAAQMYAAAADAAAAAQAAAAAAAAAGv7//wAAAQAQAAAAZHJvcG9mZl9sYXRpdHVkZQAAAAAc/v//AAABAxgAAAAMAAAABAAAAAAAAABS/v//AAABABEAAABkcm9wb2ZmX2xvbmdpdHVkZQAAAFT+//8AAAEFFAAAAAwAAAAEAAAAAAAAAET+//8SAAAAc3RvcmVfYW5kX2Z3ZF9mbGFnAACI/v//AAABARQAAAAMAAAABAAAAAAAAAB4/v//DAAAAHJhdGVfY29kZV9pZAAAAAC4/v//AAABAxgAAAAMAAAABAAAAAAAAADu/v//AAABAA8AAABwaWNrdXBfbGF0aXR1ZGUA7P7//wAAAQMYAAAADAAAAAQAAAAAAAAAIv///wAAAQAQAAAAcGlja3VwX2xvbmdpdHVkZQAAAAAk////AAABAxgAAAAMAAAABAAAAAAAAABa////AAABAA0AAAB0cmlwX2Rpc3RhbmNlAAAAWP///wAAAQIkAAAAFAAAAAQAAAAAAAAACAAMAAgABwAIAAAAAAAAAQgAAAAPAAAAcGFzc2VuZ2VyX2NvdW50AJj///8AAAEKGAAAAAwAAAAEAAAAAAAAAM7///8AAAMACgAAAGRyb3BvZmZfYXQAAMj///8AAAEKIAAAABQAAAAEAAAAAAAAAAAABgAIAAYABgAAAAAAAwAJAAAAcGlja3VwX2F0AAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEFGAAAABAAAAAEAAAAAAAAAAQABAAEAAAACQAAAHZlbmRvcl9pZAAAAA==
> {code}
> I'd argue that extra metadata, if it's not part of the Arrow format and can 
> be whatever an application wants to put in there, should not be printed as 
> part of the schema's ToString method. It should be viewable some way, just 
> not always. And IDK what to do with this {{ARROW:schema: }} business but it's 
> clearly not readable as is.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to