[ https://issues.apache.org/jira/browse/ARROW-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16374292#comment-16374292 ]
ASF GitHub Bot commented on ARROW-2066: --------------------------------------- xhochy closed pull request #1544: ARROW-2066: [Python] Document using pyarrow with Azure Blob Store URL: https://github.com/apache/arrow/pull/1544 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst index d466ba128..b394f562a 100644 --- a/python/doc/source/parquet.rst +++ b/python/doc/source/parquet.rst @@ -237,3 +237,44 @@ throughput: pq.read_table(where, nthreads=4) pq.ParquetDataset(where).read(nthreads=4) + +Reading a Parquet File from Azure Blob storage +---------------------------------------------- + +The code below shows how to use Azure's storage sdk along with pyarrow to read +a parquet file into a Pandas dataframe. +This is suitable for executing inside a Jupyter notebook running on a Python 3 +kernel. + +Dependencies: + +* python 3.6.2 +* azure-storage 0.36.0 +* pyarrow 0.8.0 + +.. code-block:: python + + import pyarrow.parquet as pq + from io import BytesIO + from azure.storage.blob import BlockBlobService + + account_name = '...' + account_key = '...' + container_name = '...' + parquet_file = 'mysample.parquet' + + byte_stream = io.BytesIO() + block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key) + try: + block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=byte_stream) + df = pq.read_table(source=byte_stream).to_pandas() + # Do work on df ... + finally: + # Add finally block to ensure closure of the stream + byte_stream.close() + +Notes: + +* The ``account_key`` can be found under ``Settings -> Access keys`` in the Microsoft Azure portal for a given container +* The code above works for a container with private access, Lease State = Available, Lease Status = Unlocked +* The parquet file was Blob Type = Block blob ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Document reading Parquet files from Azure Blob Store > ------------------------------------------------------------- > > Key: ARROW-2066 > URL: https://issues.apache.org/jira/browse/ARROW-2066 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Wes McKinney > Assignee: Uwe L. Korn > Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > See https://github.com/apache/arrow/issues/1510 -- This message was sent by Atlassian JIRA (v7.6.3#76005)