[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

ASF GitHub Bot (JIRA) Fri, 23 Feb 2018 04:39:13 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374292#comment-16374292
 ]


ASF GitHub Bot commented on ARROW-2066:
---------------------------------------

xhochy closed pull request #1544: ARROW-2066: [Python] Document using pyarrow 
with Azure Blob Store
URL: https://github.com/apache/arrow/pull/1544
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst
index d466ba128..b394f562a 100644
--- a/python/doc/source/parquet.rst
+++ b/python/doc/source/parquet.rst
@@ -237,3 +237,44 @@ throughput:
 
    pq.read_table(where, nthreads=4)
    pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+----------------------------------------------
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   from io import BytesIO
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   byte_stream = io.BytesIO()
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+      block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+      df = pq.read_table(source=byte_stream).to_pandas()
+      # Do work on df ...
+   finally:
+      # Add finally block to ensure closure of the stream
+      byte_stream.close()
+
+Notes:
+
+* The ``account_key`` can be found under ``Settings -> Access keys`` in the 
Microsoft Azure portal for a given container
+* The code above works for a container with private access, Lease State = 
Available, Lease Status = Unlocked
+* The parquet file was Blob Type = Block blob


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document reading Parquet files from Azure Blob Store
> -------------------------------------------------------------
>
>                 Key: ARROW-2066
>                 URL: https://issues.apache.org/jira/browse/ARROW-2066
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Uwe L. Korn
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/issues/1510



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2066) [Python] Document reading Parquet files from Azure Blob Store

Reply via email to