[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

GitBox Thu, 18 Jun 2020 02:44:05 -0700


jorisvandenbossche commented on a change in pull request #7474:
URL: https://github.com/apache/arrow/pull/7474#discussion_r442102463




##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -1566,3 +1566,21 @@ def test_parquet_dataset_factory_partitioned(tempdir):
     result = result.to_pandas().sort_values("f1").reset_index(drop=True)
     expected = table.to_pandas().drop(columns=["part"])
     pd.testing.assert_frame_equal(result, expected)
+
+
[email protected]
[email protected]
+def test_dataset_schema_metadata(tempdir):
+    # ARROW-8802
+    df = pd.DataFrame({'a': [1, 2, 3]})
+    path = tempdir / "test.parquet"
+    df.to_parquet(path)
+    dataset = ds.dataset(path)
+
+    schema = dataset.to_table().schema
+    projected_schema = dataset.to_table(columns=["a"]).schema
+
+    # ensure the pandas metadata is included in the schema
+    assert b"pandas" in schema.metadata

Review comment:
       added an additional assert to ensure the "pandas" key is actually 
present, because if for some reason we accidentally remove it in both cases, 
this test won't detect that as both schema's are still identical (both missing 
the metadata). 
   (although I assume we have other tests that would start failing then)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #7474: ARROW-8802: [C++][Dataset] Preserve dataset schema's metadata on column projection

Reply via email to