Darren Weber created ARROW-5324: ----------------------------------- Summary: plasma API requests Key: ARROW-5324 URL: https://issues.apache.org/jira/browse/ARROW-5324 Project: Apache Arrow Issue Type: Improvement Reporter: Darren Weber
Copied from [https://github.com/apache/arrow/issues/4318] While working with the plasma API to create/seal an object for a table, using a custom object-ID, it would help to have a convenience API to get the size of the table. The following code might help to illustrate the request and notes below: ```python if not parquet_path: parquet_path = f"./data/dataset_ {size}.parquet" if not plasma_path: plasma_path = f"./data/dataset_\{size} .plasma" try: plasma_client = plasma.connect(plasma_path) except: plasma_client = None if plasma_client: table_id = plasma.ObjectID(bytes(parquet_path[:20], encoding='utf8')) try: table = plasma_client.get(table_id, timeout_ms=4000) if table.__name__ == 'ObjectNotAvailable': raise ValueError('Failed to get plasma object') except ValueError: table = pq.read_table(parquet_path, use_threads=True) plasma_client.create_and_seal(table_id, table) ``` The use case is a workflow something like this: - process-A - generate a pandas DataFrame `df` - save the `df` to parquet, using pyarrow.parquet, with a unique parquet path - (this process will not save directly to plasma) - process-B - get the data from plasma or load it into plasma from the parquet file - use the unique parquet path to generate a unique object-ID Notes: - `plasma_client.put` for the same data-table is not idempotent, it generates unique object-ID values that are not based on any hash of the data payload, so every put saves a new object-ID; could it use a data hash for idempotent puts? e.g. ```python In : plasma_client.put(table) ObjectID(666625fcb60959d23b6bfc739f88816da29e04d6) In : plasma_client.put(table) ObjectID(d2a4662999db30177b090f9fc2bf6b28687d2f8d) In : plasma_client.put(table) ObjectID(b2928ad786de2fdb74d374055597f6e7bd97fd61) In : hash(table) TypeError: unhashable type: 'pyarrow.lib.Table' ``` - In process-B, when the data is not already in plasma, it reads data from a parquet file into a pyarrow.Table and then needs an object-ID and the table size to use plasma `client.create_and_seal` but it's not easy to get the table size - this might be related to github issue #2707 - it might be ideal if the `client.create_and_seal` accepts responsibility for the size of the object to be created when given a pyarrow data object like a table. - when the plasma store does not have the object, it could have a default timeout rather than hang indefinitely, and it's a bit clumsy to return an object that is not easily checked with `isinstance` and it could be better to have an exception handling pattern (or something like the requests 404 patterns and options?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)