[ 
https://issues.apache.org/jira/browse/AVRO-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023584#comment-17023584
 ] 

Brian Lachniet commented on AVRO-2098:
--------------------------------------

Resolved by AVRO-2618

> Avro OCF support for non-seekable stream.
> -----------------------------------------
>
>                 Key: AVRO-2098
>                 URL: https://issues.apache.org/jira/browse/AVRO-2098
>             Project: Apache Avro
>          Issue Type: New Feature
>          Components: csharp
>    Affects Versions: 1.8.2
>         Environment: csharp
> Azure Data Lake Analytics
>            Reporter: Matthew Stowe
>            Priority: Minor
>
> The Microsoft Azure environment supports saving Apache Avro files from an 
> Event Hub via a feature called Event Hub Capture.  The Event Hub Capture 
> feature can be configured to Azure Data Lake Storage (ADLS).
> When saving files to ADLS it is common to use Azure Data Lake Analytics 
> (ADLA) to run batch processing jobs in U-SQL over the raw storage files.  
> When doing this ADLA supports extractors that can deal with the format of the 
> file (e.g. Avro OCF) and extract file contents for downstream manipulation 
> and filtering.
> An issue I have encountered with the existing csharp implementation is that 
> the DataFileReader relies on the provided stream to support seeking.  
> However, the stream provided by ADLA does not support seeking.  This leaves 
> the integrating developer with 2 options...
> # is to read the entire stream in to memory and provide a memory backed 
> stream to the DataFileReader.  This is not ideal as files can be large and 
> consuming a lot of memory at once during processing may have undesired 
> affects on ADLA's ability to process files in parallel, as resources are of 
> course limited.
> # is to enhance the DataFileReader to be able to work with streams that are 
> not seekable.  With respect to this option I have implemented a short-term 
> workaround that can wrap a non-seekeable stream and allow seeking in the 
> pattern employed by the DataFileReader until this feature has been reviewed 
> and potentially implemented.  My workaround is brittle and subject to 
> breaking as the DataFileReader evolves and is not the desired long term 
> approach to dealing with this issue.
> [AvroDataFileReaderStream.cs|https://github.com/thebothead/apache-avro-adla/blob/master/src/Avro.IO.ADLA/AvroDataFileReaderStream.cs]
> Also note that I have submitted a comment to the Avro support form for ADL 
> regarding their lack of support for seekable streams.
> [How can we improve Microsoft Azure Data 
> Lake?|https://feedback.azure.com/forums/327234-data-lake/suggestions/16616362-support-avro-in-azure-data-lake-analytics]
> [Add support for seekable streams in Azure Data Lake 
> Analytics.|https://feedback.azure.com/forums/327234-data-lake/suggestions/31959457-add-support-for-seekable-streams-in-azure-data-lak]
> Cheers,
> Matt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to