[
https://issues.apache.org/jira/browse/ARROW-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840569#comment-16840569
]
Darren Weber edited comment on ARROW-5324 at 5/15/19 5:06 PM:
--------------------------------------------------------------
An attempt to use unique object-IDs saved in a JSON file bumped into a
serializing exception, i.e.
{noformat}
TypeError: Object of type ObjectID is not JSON serializable
{noformat}
{code:python}
plasma_objects_file = Path("./data/plasma_objects.json")
if plasma_objects_file.exists():
plasma_objects = json.load(plasma_objects_file.read())
else:
plasma_objects = {}
try:
table_id = plasma_objects[parquet_path]
table = plasma_client.get(table_id, timeout_ms=4000)
if table.__name__ == 'ObjectNotAvailable':
raise ValueError('Failed to get plasma object')
except (KeyError, ValueError):
table = pq.read_table(parquet_path, use_threads=True)
table_id = plasma_client.put(table)
plasma_objects[parquet_path] = table_id
plasma_objects_file.write_text(json.dumps(plasma_objects))
df = table.to_pandas()
{code}
It might help if the object-ID has some API enhancements to help with
serializing/deserializing it. While exploring the current object:
{code:python}
ipdb> plasma_objects
{'./data/dataset_10.parquet':
ObjectID(0ddf993d6b8e2914d9a0ae9a0b4b0eced7397549)}
ipdb> table_id
ObjectID(0ddf993d6b8e2914d9a0ae9a0b4b0eced7397549)
ipdb> str(table_id)
'ObjectID(0ddf993d6b8e2914d9a0ae9a0b4b0eced7397549)'
ipdb> dir(table_id)
['__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__',
'__ge__', '__getattribute__', '__gt__', '__hash__', '__init__',
'__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', 'binary', 'from_random']
ipdb> table_id.binary
<built-in method binary of pyarrow._plasma.ObjectID object at 0x7f1974cbca58>
ipdb> table_id.binary()
b'\r\xdf\x99=k\x8e)\x14\xd9\xa0\xae\x9a\x0bK\x0e\xce\xd79uI'
ipdb> table_id.from_random()
ObjectID(0434422c7cd08e3c9fafc2ffd3f7234784ec0d49)
ipdb> table_id
ObjectID(0ddf993d6b8e2914d9a0ae9a0b4b0eced7397549)
ipdb> type(table_id.binary())
<class 'bytes'>
ipdb> bytes(tabled_id)
*** NameError: name 'tabled_id' is not defined
ipdb> bytes(table_id)
*** TypeError: cannot convert 'pyarrow._plasma.ObjectID' object to bytes
ipdb> interact
In : b = table_id.binary()
In : b
b'\r\xdf\x99=k\x8e)\x14\xd9\xa0\xae\x9a\x0bK\x0e\xce\xd79uI'
In : plasma.ObjectID
<class 'pyarrow._plasma.ObjectID'>
In : plasma.ObjectID(b)
ObjectID(0ddf993d6b8e2914d9a0ae9a0b4b0eced7397549)
{code}
was (Author: dazza):
An attempt to use unique object-IDs saved in a JSON file bumped into a
serializing exception, i.e.
{noformat}
TypeError: Object of type ObjectID is not JSON serializable
{noformat}
{code:python}
plasma_objects_file = Path("./data/plasma_objects.json")
if plasma_objects_file.exists():
plasma_objects = json.load(plasma_objects_file.read())
else:
plasma_objects = {}
try:
table_id = plasma_objects[parquet_path]
table = plasma_client.get(table_id, timeout_ms=4000)
if table.__name__ == 'ObjectNotAvailable':
raise ValueError('Failed to get plasma object')
except (KeyError, ValueError):
table = pq.read_table(parquet_path, use_threads=True)
table_id = plasma_client.put(table)
plasma_objects[parquet_path] = table_id
plasma_objects_file.write_text(json.dumps(plasma_objects))
df = table.to_pandas()
{code}
> plasma API requests
> -------------------
>
> Key: ARROW-5324
> URL: https://issues.apache.org/jira/browse/ARROW-5324
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: Darren Weber
> Priority: Minor
>
> Copied from [https://github.com/apache/arrow/issues/4318] (it's easier to
> read there, sorry hate Jira formatting)
> Related to https://issues.apache.org/jira/browse/ARROW-3444
> While working with the plasma API to create/seal an object for a table, using
> a custom object-ID, it would help to have a convenience API to get the size
> of the table.
> The following code might help to illustrate the request and notes below:
> {code:java}
> if not parquet_path:
> parquet_path = f"./data/dataset_{size}.parquet"
> if not plasma_path:
> plasma_path = f"./data/dataset_{size}.plasma"
> try:
> plasma_client = plasma.connect(plasma_path)
> except:
> plasma_client = None
> if plasma_client:
> table_id = plasma.ObjectID(bytes(parquet_path[:20], encoding='utf8'))
> try:
> table = plasma_client.get(table_id, timeout_ms=4000)
> if table.__name__ == 'ObjectNotAvailable':
> raise ValueError('Failed to get plasma object')
> except ValueError:
> table = pq.read_table(parquet_path, use_threads=True)
> plasma_client.create_and_seal(table_id, table)
> {code}
>
> The use case is a workflow something like this:
> - process-A
> ** generate a pandas DataFrame `df`
> ** save the `df` to parquet, using pyarrow.parquet, with a unique parquet
> path
> ** (this process will not save directly to plasma)
> - process-B
> ** get the data from plasma or load it into plasma from the parquet file
> ** use the unique parquet path to generate a unique object-ID
> Notes:
> - `plasma_client.put` for the same data-table is not idempotent, it
> generates unique object-ID values that are not based on any hash of the data
> payload, so every put saves a new object-ID; could it use a data hash for
> idempotent puts? e.g.
> -
> {code:java}
> In : plasma_client.put(table)
> ObjectID(666625fcb60959d23b6bfc739f88816da29e04d6)
> In : plasma_client.put(table)
> ObjectID(d2a4662999db30177b090f9fc2bf6b28687d2f8d)
> In : plasma_client.put(table)
> ObjectID(b2928ad786de2fdb74d374055597f6e7bd97fd61)
> In : hash(table)
> TypeError: unhashable type: 'pyarrow.lib.Table'{code}
> - In process-B, when the data is not already in plasma, it reads data from a
> parquet file into a pyarrow.Table and then needs an object-ID and the table
> size to use plasma `client.create_and_seal` but it's not easy to get the
> table size - this might be related to github issue #2707 (#3444) - it might
> be ideal if the `client.create_and_seal` accepts responsibility for the size
> of the object to be created when given a pyarrow data object like a table.
> - when the plasma store does not have the object, it could have a default
> timeout rather than hang indefinitely, and it's a bit clumsy to return an
> object that is not easily checked with `isinstance` and it could be better to
> have an exception handling pattern (or something like the requests 404
> patterns and options?)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)