[
https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Lee updated ARROW-4032:
-----------------------------
Description:
Here's a proposal to create a pyarrow.Table.from_pydict() function.
Right now only pyarrow.Table.from_pandas() exist and there are inherit problems
using Pandas with NULL support for Int(s) and Boolean(s)
[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
Sample python code on how this would work.
{code:java}
import pyarrow as pa
from datetime import datetime
# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute,
today.second, today.microsecond - today.microsecond % 1000)
test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]
def from_pylist(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table
test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday',
'dummy'])
test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])
test2 = from_pylist(test_list, schema=test_schema)
{code}
was:
Here's a proposal to create a pyarrow.Table.from_pydict() function.
Right now only pyarrow.Table.from_pandas() exist and there are inherit problems
using Pandas with NULL support for Int(s) and Boolean(s)
[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
Sample python code on how this would work.
{code:java}
import pyarrow as pa
from datetime import datetime
# convert microseconds to milliseconds. More support for MS in parquet.
today = datetime.now()
today = datetime(today.year, today.month, today.day, today.hour, today.minute,
today.second, today.microsecond - today.microsecond % 1000)
test_list = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": today}
]
def from_pydict(pylist, schema=None, columns=None, safe=True):
arrow_columns = list()
if schema:
columns = schema.names
if not columns:
return
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v
in pylist], safe=safe))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
if schema:
arrow_table = arrow_table.cast(schema, safe=safe)
return arrow_table
test = from_pydict(test_list, columns=['name' , 'age', 'city', 'birthday',
'dummy'])
test_schema = pa.schema([
pa.field('name', pa.string()),
pa.field('age', pa.int16()),
pa.field('city', pa.string()),
pa.field('birthday', pa.timestamp('ms'))
])
test2 = from_pydict(test_list, schema=test_schema)
{code}
> [Python] New pyarrow.Table.from_pydict() function
> -------------------------------------------------
>
> Key: ARROW-4032
> URL: https://issues.apache.org/jira/browse/ARROW-4032
> Project: Apache Arrow
> Issue Type: Task
> Components: Python
> Reporter: David Lee
> Priority: Minor
>
> Here's a proposal to create a pyarrow.Table.from_pydict() function.
> Right now only pyarrow.Table.from_pandas() exist and there are inherit
> problems using Pandas with NULL support for Int(s) and Boolean(s)
> [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]
> {{NaN}}, Integer {{NA}} values and {{NA}} type promotions:
> Sample python code on how this would work.
>
> {code:java}
> import pyarrow as pa
> from datetime import datetime
> # convert microseconds to milliseconds. More support for MS in parquet.
> today = datetime.now()
> today = datetime(today.year, today.month, today.day, today.hour,
> today.minute, today.second, today.microsecond - today.microsecond % 1000)
> test_list = [
> {"name": "Tom", "age": 10},
> {"name": "Mark", "age": 5, "city": "San Francisco"},
> {"name": "Pam", "age": 7, "birthday": today}
> ]
> def from_pylist(pylist, schema=None, columns=None, safe=True):
> arrow_columns = list()
> if schema:
> columns = schema.names
> if not columns:
> return
> for column in columns:
> arrow_columns.append(pa.array([v[column] if column in v else None for
> v in pylist], safe=safe))
> arrow_table = pa.Table.from_arrays(arrow_columns, columns)
> if schema:
> arrow_table = arrow_table.cast(schema, safe=safe)
> return arrow_table
> test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday',
> 'dummy'])
> test_schema = pa.schema([
> pa.field('name', pa.string()),
> pa.field('age', pa.int16()),
> pa.field('city', pa.string()),
> pa.field('birthday', pa.timestamp('ms'))
> ])
> test2 = from_pylist(test_list, schema=test_schema)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)