[ 
https://issues.apache.org/jira/browse/ARROW-5324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darren Weber updated ARROW-5324:
--------------------------------
    Description: 
Copied from [https://github.com/apache/arrow/issues/4318] (it's easier to read 
there, sorry hate Jira formatting)

Related to https://issues.apache.org/jira/browse/ARROW-3444 

While working with the plasma API to create/seal an object for a table, using a 
custom object-ID, it would help to have a convenience API to get the size of 
the table.

The following code might help to illustrate the request and notes below:
{code:java}
    if not parquet_path:
        parquet_path = f"./data/dataset_{size}.parquet"

    if not plasma_path:
        plasma_path = f"./data/dataset_{size}.plasma"

    try:
        plasma_client = plasma.connect(plasma_path)
    except:
        plasma_client = None

    if plasma_client:
        table_id = plasma.ObjectID(bytes(parquet_path[:20], encoding='utf8'))
        try:
            table = plasma_client.get(table_id, timeout_ms=4000)
            if table.__name__ == 'ObjectNotAvailable':
                raise ValueError('Failed to get plasma object')
        except ValueError:
            table = pq.read_table(parquet_path, use_threads=True)
            plasma_client.create_and_seal(table_id, table)
{code}
 

The use case is a workflow something like this:
 - process-A
 ** generate a pandas DataFrame `df`
 ** save the `df` to parquet, using pyarrow.parquet, with a unique parquet path
 ** (this process will not save directly to plasma)
 - process-B
 ** get the data from plasma or load it into plasma from the parquet file
 ** use the unique parquet path to generate a unique object-ID

Notes:
 - `plasma_client.put` for the same data-table is not idempotent, it generates 
unique object-ID values that are not based on any hash of the data payload, so 
every put saves a new object-ID; could it use a data hash for idempotent puts? 
e.g.
 - 
{code:java}
In : plasma_client.put(table)
ObjectID(666625fcb60959d23b6bfc739f88816da29e04d6)
In : plasma_client.put(table)
ObjectID(d2a4662999db30177b090f9fc2bf6b28687d2f8d)
In : plasma_client.put(table)
ObjectID(b2928ad786de2fdb74d374055597f6e7bd97fd61)

In : hash(table)
TypeError: unhashable type: 'pyarrow.lib.Table'{code}

 - In process-B, when the data is not already in plasma, it reads data from a 
parquet file into a pyarrow.Table and then needs an object-ID and the table 
size to use plasma `client.create_and_seal` but it's not easy to get the table 
size - this might be related to github issue #2707 (#3444) - it might be ideal 
if the `client.create_and_seal` accepts responsibility for the size of the 
object to be created when given a pyarrow data object like a table.
 - when the plasma store does not have the object, it could have a default 
timeout rather than hang indefinitely, and it's a bit clumsy to return an 
object that is not easily checked with `isinstance` and it could be better to 
have an exception handling pattern (or something like the requests 404 patterns 
and options?)

  was:
Copied from [https://github.com/apache/arrow/issues/4318] (it's easier to read 
there, sorry hate Jira formatting)

 

While working with the plasma API to create/seal an object for a table, using a 
custom object-ID, it would help to have a convenience API to get the size of 
the table.

The following code might help to illustrate the request and notes below:
{code:java}
    if not parquet_path:
        parquet_path = f"./data/dataset_{size}.parquet"

    if not plasma_path:
        plasma_path = f"./data/dataset_{size}.plasma"

    try:
        plasma_client = plasma.connect(plasma_path)
    except:
        plasma_client = None

    if plasma_client:
        table_id = plasma.ObjectID(bytes(parquet_path[:20], encoding='utf8'))
        try:
            table = plasma_client.get(table_id, timeout_ms=4000)
            if table.__name__ == 'ObjectNotAvailable':
                raise ValueError('Failed to get plasma object')
        except ValueError:
            table = pq.read_table(parquet_path, use_threads=True)
            plasma_client.create_and_seal(table_id, table)
{code}
 

The use case is a workflow something like this:
 - process-A
 ** generate a pandas DataFrame `df`
 ** save the `df` to parquet, using pyarrow.parquet, with a unique parquet path
 ** (this process will not save directly to plasma)
 - process-B
 ** get the data from plasma or load it into plasma from the parquet file
 ** use the unique parquet path to generate a unique object-ID

Notes:
 - `plasma_client.put` for the same data-table is not idempotent, it generates 
unique object-ID values that are not based on any hash of the data payload, so 
every put saves a new object-ID; could it use a data hash for idempotent puts? 
e.g.
 - 
{code:java}
In : plasma_client.put(table)
ObjectID(666625fcb60959d23b6bfc739f88816da29e04d6)
In : plasma_client.put(table)
ObjectID(d2a4662999db30177b090f9fc2bf6b28687d2f8d)
In : plasma_client.put(table)
ObjectID(b2928ad786de2fdb74d374055597f6e7bd97fd61)

In : hash(table)
TypeError: unhashable type: 'pyarrow.lib.Table'{code}

 - In process-B, when the data is not already in plasma, it reads data from a 
parquet file into a pyarrow.Table and then needs an object-ID and the table 
size to use plasma `client.create_and_seal` but it's not easy to get the table 
size - this might be related to github issue #2707 (#3444) - it might be ideal 
if the `client.create_and_seal` accepts responsibility for the size of the 
object to be created when given a pyarrow data object like a table.
 - when the plasma store does not have the object, it could have a default 
timeout rather than hang indefinitely, and it's a bit clumsy to return an 
object that is not easily checked with `isinstance` and it could be better to 
have an exception handling pattern (or something like the requests 404 patterns 
and options?)


> plasma API requests
> -------------------
>
>                 Key: ARROW-5324
>                 URL: https://issues.apache.org/jira/browse/ARROW-5324
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Darren Weber
>            Priority: Minor
>
> Copied from [https://github.com/apache/arrow/issues/4318] (it's easier to 
> read there, sorry hate Jira formatting)
> Related to https://issues.apache.org/jira/browse/ARROW-3444 
> While working with the plasma API to create/seal an object for a table, using 
> a custom object-ID, it would help to have a convenience API to get the size 
> of the table.
> The following code might help to illustrate the request and notes below:
> {code:java}
>     if not parquet_path:
>         parquet_path = f"./data/dataset_{size}.parquet"
>     if not plasma_path:
>         plasma_path = f"./data/dataset_{size}.plasma"
>     try:
>         plasma_client = plasma.connect(plasma_path)
>     except:
>         plasma_client = None
>     if plasma_client:
>         table_id = plasma.ObjectID(bytes(parquet_path[:20], encoding='utf8'))
>         try:
>             table = plasma_client.get(table_id, timeout_ms=4000)
>             if table.__name__ == 'ObjectNotAvailable':
>                 raise ValueError('Failed to get plasma object')
>         except ValueError:
>             table = pq.read_table(parquet_path, use_threads=True)
>             plasma_client.create_and_seal(table_id, table)
> {code}
>  
> The use case is a workflow something like this:
>  - process-A
>  ** generate a pandas DataFrame `df`
>  ** save the `df` to parquet, using pyarrow.parquet, with a unique parquet 
> path
>  ** (this process will not save directly to plasma)
>  - process-B
>  ** get the data from plasma or load it into plasma from the parquet file
>  ** use the unique parquet path to generate a unique object-ID
> Notes:
>  - `plasma_client.put` for the same data-table is not idempotent, it 
> generates unique object-ID values that are not based on any hash of the data 
> payload, so every put saves a new object-ID; could it use a data hash for 
> idempotent puts? e.g.
>  - 
> {code:java}
> In : plasma_client.put(table)
> ObjectID(666625fcb60959d23b6bfc739f88816da29e04d6)
> In : plasma_client.put(table)
> ObjectID(d2a4662999db30177b090f9fc2bf6b28687d2f8d)
> In : plasma_client.put(table)
> ObjectID(b2928ad786de2fdb74d374055597f6e7bd97fd61)
> In : hash(table)
> TypeError: unhashable type: 'pyarrow.lib.Table'{code}
>  - In process-B, when the data is not already in plasma, it reads data from a 
> parquet file into a pyarrow.Table and then needs an object-ID and the table 
> size to use plasma `client.create_and_seal` but it's not easy to get the 
> table size - this might be related to github issue #2707 (#3444) - it might 
> be ideal if the `client.create_and_seal` accepts responsibility for the size 
> of the object to be created when given a pyarrow data object like a table.
>  - when the plasma store does not have the object, it could have a default 
> timeout rather than hang indefinitely, and it's a bit clumsy to return an 
> object that is not easily checked with `isinstance` and it could be better to 
> have an exception handling pattern (or something like the requests 404 
> patterns and options?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to