[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

ASF GitHub Bot (JIRA) Wed, 25 Oct 2017 19:25:51 -0700

    [ 
https://issues.apache.org/jira/browse/ARROW-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219863#comment-16219863
 ]


ASF GitHub Bot commented on ARROW-1675:
---------------------------------------

wesm closed pull request #1250: ARROW-1675: [Python] Use 
RecordBatch.from_pandas in Feather write path
URL: https://github.com/apache/arrow/pull/1250
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py
index 2091c9154..3ba9d652c 100644
--- a/python/pyarrow/feather.py
+++ b/python/pyarrow/feather.py
@@ -23,7 +23,7 @@
 
 from pyarrow.compat import pdapi
 from pyarrow.lib import FeatherError  # noqa
-from pyarrow.lib import Table
+from pyarrow.lib import RecordBatch, Table
 import pyarrow.lib as ext
 
 try:
@@ -75,30 +75,12 @@ def write(self, df):
         if not df.columns.is_unique:
             raise ValueError("cannot serialize duplicate column names")
 
-        # TODO(wesm): pipeline conversion to Arrow memory layout
-        for i, name in enumerate(df.columns):
-            col = df.iloc[:, i]
-
-            if pdapi.is_object_dtype(col):
-                inferred_type = infer_dtype(col)
-                msg = ("cannot serialize column {n} "
-                       "named {name} with dtype {dtype}".format(
-                           n=i, name=name, dtype=inferred_type))
-
-                if inferred_type in ['mixed']:
-
-                    # allow columns with nulls + an inferable type
-                    inferred_type = infer_dtype(col[col.notnull()])
-                    if inferred_type in ['mixed']:
-                        raise ValueError(msg)
-
-                elif inferred_type not in ['unicode', 'string']:
-                    raise ValueError(msg)
-
-            if not isinstance(name, six.string_types):
-                name = str(name)
-
-            self.writer.write_array(name, col)
+        # TODO(wesm): Remove this length check, see ARROW-1732
+        if len(df.columns) > 0:
+            batch = RecordBatch.from_pandas(df, preserve_index=False)
+            for i, name in enumerate(batch.schema.names):
+                col = batch[i]
+                self.writer.write_array(name, col)
 
         self.writer.close()
 
diff --git a/python/pyarrow/tests/test_feather.py 
b/python/pyarrow/tests/test_feather.py
index 810ee3c8c..9e7fc8863 100644
--- a/python/pyarrow/tests/test_feather.py
+++ b/python/pyarrow/tests/test_feather.py
@@ -279,11 +279,14 @@ def test_delete_partial_file_on_error(self):
         if sys.platform == 'win32':
             pytest.skip('Windows hangs on to file handle for some reason')
 
+        class CustomClass(object):
+            pass
+
         # strings will fail
         df = pd.DataFrame(
             {
                 'numbers': range(5),
-                'strings': [b'foo', None, u'bar', 'qux', np.nan]},
+                'strings': [b'foo', None, u'bar', CustomClass(), np.nan]},
             columns=['numbers', 'strings'])
 
         path = random_path()
@@ -297,10 +300,13 @@ def test_delete_partial_file_on_error(self):
     def test_strings(self):
         repeats = 1000
 
-        # we hvae mixed bytes, unicode, strings
+        # Mixed bytes, unicode, strings coerced to binary
         values = [b'foo', None, u'bar', 'qux', np.nan]
         df = pd.DataFrame({'strings': values * repeats})
-        self._assert_error_on_write(df, ValueError)
+
+        ex_values = [b'foo', None, b'bar', b'qux', np.nan]
+        expected = pd.DataFrame({'strings': ex_values * repeats})
+        self._check_pandas_roundtrip(df, expected, null_counts=[2 * repeats])
 
         # embedded nulls are ok
         values = ['foo', None, 'bar', 'qux', None]
diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi
index 686e56ead..c9a490960 100644
--- a/python/pyarrow/types.pxi
+++ b/python/pyarrow/types.pxi
@@ -662,7 +662,6 @@ cdef _as_type(type):
     return type_for_alias(type)
 
 
-
 cdef set PRIMITIVE_TYPES = set([
     _Type_NA, _Type_BOOL,
     _Type_UINT8, _Type_INT8,


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> [Python] Use RecordBatch.from_pandas in FeatherWriter.write
> -----------------------------------------------------------
>
>                 Key: ARROW-1675
>                 URL: https://issues.apache.org/jira/browse/ARROW-1675
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>              Labels: pull-request-available
>             Fix For: 0.8.0
>
>
> In addition to making the implementation simpler, we will also benefit from 
> multithreaded conversions, so faster write speeds



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ARROW-1675) [Python] Use RecordBatch.from_pandas in FeatherWriter.write

Reply via email to