[ https://issues.apache.org/jira/browse/ARROW-4032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721986#comment-16721986 ]
David Lee edited comment on ARROW-4032 at 12/15/18 3:53 AM: ------------------------------------------------------------ Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, schema.names) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} was (Author: davlee1...@yahoo.com): Ended up just writing from_pylist() and to_pylist().. They run much faster than going through pandas.. {code:java} def from_pylist(pylist, schema, safe=True): arrow_columns = list() for column in schema.names: arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)])) arrow_table = pa.Table.from_arrays(arrow_columns, columns) return arrow_table def to_pylist(arrow_table): od = pyarrow.Table.to_pydict(arrow_table) pylist = list() columns = list(arrow_table.keys()) rows = len(arrow_table[columns[0]]) for row in range(rows): pylist.append({key: arrow_table[key][row] for key in columns}) return pylist {code} > [Python] New pyarrow.Table.from_pylist() function > ------------------------------------------------- > > Key: ARROW-4032 > URL: https://issues.apache.org/jira/browse/ARROW-4032 > Project: Apache Arrow > Issue Type: Task > Components: Python > Reporter: David Lee > Priority: Minor > > Here's a proposal to create a pyarrow.Table.from_pydict() function. > Right now only pyarrow.Table.from_pandas() exist and there are inherit > problems using Pandas with NULL support for Int(s) and Boolean(s) > [http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html] > {{NaN}}, Integer {{NA}} values and {{NA}} type promotions: > Sample python code on how this would work. > > {code:java} > import pyarrow as pa > from datetime import datetime > # convert microseconds to milliseconds. More support for MS in parquet. > today = datetime.now() > today = datetime(today.year, today.month, today.day, today.hour, > today.minute, today.second, today.microsecond - today.microsecond % 1000) > test_list = [ > {"name": "Tom", "age": 10}, > {"name": "Mark", "age": 5, "city": "San Francisco"}, > {"name": "Pam", "age": 7, "birthday": today} > ] > def from_pylist(pylist, schema=None, columns=None, safe=True): > arrow_columns = list() > if schema: > columns = schema.names > if not columns: > return > for column in columns: > arrow_columns.append(pa.array([v[column] if column in v else None for > v in pylist], safe=safe)) > arrow_table = pa.Table.from_arrays(arrow_columns, columns) > if schema: > arrow_table = arrow_table.cast(schema, safe=safe) > return arrow_table > test = from_pylist(test_list, columns=['name' , 'age', 'city', 'birthday', > 'dummy']) > test_schema = pa.schema([ > pa.field('name', pa.string()), > pa.field('age', pa.int16()), > pa.field('city', pa.string()), > pa.field('birthday', pa.timestamp('ms')) > ]) > test2 = from_pylist(test_list, schema=test_schema) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)