[
https://issues.apache.org/jira/browse/AVRO-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023584#comment-17023584
]
Brian Lachniet commented on AVRO-2098:
--------------------------------------
Resolved by AVRO-2618
> Avro OCF support for non-seekable stream.
> -----------------------------------------
>
> Key: AVRO-2098
> URL: https://issues.apache.org/jira/browse/AVRO-2098
> Project: Apache Avro
> Issue Type: New Feature
> Components: csharp
> Affects Versions: 1.8.2
> Environment: csharp
> Azure Data Lake Analytics
> Reporter: Matthew Stowe
> Priority: Minor
>
> The Microsoft Azure environment supports saving Apache Avro files from an
> Event Hub via a feature called Event Hub Capture. The Event Hub Capture
> feature can be configured to Azure Data Lake Storage (ADLS).
> When saving files to ADLS it is common to use Azure Data Lake Analytics
> (ADLA) to run batch processing jobs in U-SQL over the raw storage files.
> When doing this ADLA supports extractors that can deal with the format of the
> file (e.g. Avro OCF) and extract file contents for downstream manipulation
> and filtering.
> An issue I have encountered with the existing csharp implementation is that
> the DataFileReader relies on the provided stream to support seeking.
> However, the stream provided by ADLA does not support seeking. This leaves
> the integrating developer with 2 options...
> # is to read the entire stream in to memory and provide a memory backed
> stream to the DataFileReader. This is not ideal as files can be large and
> consuming a lot of memory at once during processing may have undesired
> affects on ADLA's ability to process files in parallel, as resources are of
> course limited.
> # is to enhance the DataFileReader to be able to work with streams that are
> not seekable. With respect to this option I have implemented a short-term
> workaround that can wrap a non-seekeable stream and allow seeking in the
> pattern employed by the DataFileReader until this feature has been reviewed
> and potentially implemented. My workaround is brittle and subject to
> breaking as the DataFileReader evolves and is not the desired long term
> approach to dealing with this issue.
> [AvroDataFileReaderStream.cs|https://github.com/thebothead/apache-avro-adla/blob/master/src/Avro.IO.ADLA/AvroDataFileReaderStream.cs]
> Also note that I have submitted a comment to the Avro support form for ADL
> regarding their lack of support for seekable streams.
> [How can we improve Microsoft Azure Data
> Lake?|https://feedback.azure.com/forums/327234-data-lake/suggestions/16616362-support-avro-in-azure-data-lake-analytics]
> [Add support for seekable streams in Azure Data Lake
> Analytics.|https://feedback.azure.com/forums/327234-data-lake/suggestions/31959457-add-support-for-seekable-streams-in-azure-data-lak]
> Cheers,
> Matt
--
This message was sent by Atlassian Jira
(v8.3.4#803005)