Re: [PR] GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation [arrow]

via GitHub Thu, 07 Dec 2023 01:04:52 -0800


jorisvandenbossche commented on code in PR #39112:
URL: https://github.com/apache/arrow/pull/39112#discussion_r1418573854



##########
docs/source/python/parquet.rst:
##########
@@ -511,36 +511,20 @@ from a remote filesystem into a pandas dataframe you may 
need to run
 ``sort_index`` to maintain row ordering (as long as the ``preserve_index``
 option was enabled on write).
 
-.. note::
-
-   The ParquetDataset is being reimplemented based on the new generic Dataset
-   API (see the :ref:`dataset` docs for an overview). This is not yet the
-   default, but can already be enabled by passing the 
``use_legacy_dataset=False``
-   keyword to :class:`ParquetDataset` or :func:`read_table`::
-
-      pq.ParquetDataset('dataset_name/', use_legacy_dataset=False)
-
-   Enabling this gives the following new features:
-
-   - Filtering on all columns (using row group statistics) instead of only on
-     the partition keys.
-   - More fine-grained partitioning: support for a directory partitioning 
scheme
-     in addition to the Hive-like partitioning (e.g. "/2019/11/15/" instead of
-     "/year=2019/month=11/day=15/"), and the ability to specify a schema for
-     the partition keys.
-   - General performance improvement and bug fixes.
+Other features:
 
-   It also has the following changes in behaviour:
+- Filtering on all columns (using row group statistics) instead of only on
+   the partition keys.

Review Comment:
   Formatting nit: restructured text requires aligned indentation (same for 
below)
   
   ```suggestion
   - Filtering on all columns (using row group statistics) instead of only on
     the partition keys.
   ```



##########
python/pyarrow/parquet/core.py:
##########
@@ -3234,13 +2170,12 @@ def _mkdir_if_not_exists(fs, path):
 
 def write_to_dataset(table, root_path, partition_cols=None,
                      partition_filename_cb=None, filesystem=None,
-                     use_legacy_dataset=None, schema=None,

Review Comment:
   same here



##########
python/pyarrow/tests/parquet/test_basic.py:
##########
@@ -63,21 +62,20 @@ def test_parquet_invalid_version(tempdir):
                      data_page_version="2.2")
 
 
-@parametrize_legacy_dataset
-def test_set_data_page_size(use_legacy_dataset):
[email protected]

Review Comment:
   I know this mark was in practice used through `@parametrize_legacy_dataset`, 
but we might not need to add it for all cases. Here, this `_check_roundtrip` is 
using `pq.read_table` under the hood. But for many cases (reading simple 
files), we have a fallback in `read_table` to use ParquetFile in case the 
dataset API is not available.
   
   So I think we only need to mark those tests with `@pytest.mark.dataset` if 
it is actually testing a feature that is only supported in the dataset-based 
code path (eg reading multiple files instead of a single file, or when 
specifying a filter or partitioning)



##########
python/pyarrow/parquet/core.py:
##########
@@ -2420,18 +1380,80 @@ class _ParquetDatasetV2:
     create a ParquetDataset object with filter:
 
     >>> dataset = pq.ParquetDataset('dataset_v2/',
-    ...                             filters=[('n_legs','=',4)],
-    ...                             use_legacy_dataset=False)
+    ...                             filters=[('n_legs','=',4)])
     >>> dataset.read().to_pandas()
        n_legs animal  year
     0       4    Dog  2021
     1       4  Horse  2022
-    """
+"""
+
+
+class ParquetDataset:
+    __doc__ = """
+    Encapsulates details of reading a complete Parquet dataset possibly
+    consisting of multiple files and partitions in subdirectories.
+
+    Parameters
+    ----------
+    path_or_paths : str or List[str]
+        A directory name, single file name, or list of file names.
+    filesystem : FileSystem, default None
+        If nothing passed, will be inferred based on path.
+        Path will try to be found in the local on-disk filesystem otherwise
+        it will be parsed as an URI to determine the filesystem.
+    schema : pyarrow.parquet.Schema
+        Optionally provide the Schema for the Dataset, in which case it will
+        not be inferred from the source.
+    filters : pyarrow.compute.Expression or List[Tuple] or List[List[Tuple]],
+    default None
+        Rows which do not match the filter predicate will be removed from 
scanned
+        data. Partition keys embedded in a nested directory structure will be
+        exploited to avoid loading files at all if they contain no matching 
rows.
+        Within-file level filtering and different partitioning schemes are 
supported.
+
+        {1}
+    {0}
+    ignore_prefixes : list, optional
+        Files matching any of these prefixes will be ignored by the
+        discovery process.
+        This is matched to the basename of a path.
+        By default this is ['.', '_'].
+        Note that discovery happens only if a directory is passed as source.
+    pre_buffer : bool, default True
+        Coalesce and issue file reads in parallel to improve performance on
+        high-latency filesystems (e.g. S3, GCS). If True, Arrow will use a
+        background I/O thread pool. If using a filesystem layer that itself
+        performs readahead (e.g. fsspec's S3FS), disable readahead for best
+        results. Set to False if you want to prioritize minimal memory usage
+        over maximum speed.
+    coerce_int96_timestamp_unit : str, default None
+        Cast timestamps that are stored in INT96 format to a particular 
resolution
+        (e.g. 'ms'). Setting to None is equivalent to 'ns' and therefore INT96
+        timestamps will be inferred as timestamps in nanoseconds.
+    decryption_properties : FileDecryptionProperties or None
+        File-level decryption properties.
+        The decryption properties can be created using
+        ``CryptoFactory.file_decryption_properties()``.
+    thrift_string_size_limit : int, default None
+        If not None, override the maximum total string size allocated
+        when decoding Thrift structures. The default limit should be
+        sufficient for most Parquet files.
+    thrift_container_size_limit : int, default None
+        If not None, override the maximum total size of containers allocated
+        when decoding Thrift structures. The default limit should be
+        sufficient for most Parquet files.
+    page_checksum_verification : bool, default False
+        If True, verify the page checksum for each page read from the file.
+
+    Examples
+    --------
+    {2}
+    """.format(_read_docstring_common, _DNF_filter_doc, 
_parquet_dataset_example)
 
-    def __init__(self, path_or_paths, filesystem=None, *, filters=None,
+    def __init__(self, path_or_paths, filesystem=None, schema=None, *, 
filters=None,
                  partitioning="hive", read_dictionary=None, buffer_size=None,
                  memory_map=False, ignore_prefixes=None, pre_buffer=True,
-                 coerce_int96_timestamp_unit=None, schema=None,
+                 coerce_int96_timestamp_unit=None,
                  decryption_properties=None, thrift_string_size_limit=None,
                  thrift_container_size_limit=None,
                  page_checksum_verification=False,

Review Comment:
   We might need to keep the `use_legacy_dataset` keyword here for a bit 
longer, but just ignore it / raise a deprecation warning when it is specified 
by the user



##########
python/pyarrow/parquet/core.py:
##########
@@ -2961,129 +1949,77 @@ def partitioning(self):
 """
 
 
-def read_table(source, *, columns=None, use_threads=True, metadata=None,
+def read_table(source, *, columns=None, use_threads=True,
                schema=None, use_pandas_metadata=False, read_dictionary=None,
                memory_map=False, buffer_size=0, partitioning="hive",
-               filesystem=None, filters=None, use_legacy_dataset=False,

Review Comment:
   same here about temporarily keeping the keyword around (eg if a user was now 
explicitly specifying `use_legacy_dataset=False` themselves in their code, that 
shouldn't break)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-31303: [Python] Remove the legacy ParquetDataset custom python-based implementation [arrow]

Reply via email to