[ 
https://issues.apache.org/jira/browse/AVRO-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Stowe updated AVRO-2098:
--------------------------------
    Description: 
The Microsoft Azure environment supports saving Apache Avro files from an Event 
Hub via a feature called Event Hub Capture.  The Event Hub Capture feature can 
be configured to Azure Data Lake Storage (ADLS).

When saving files to ADLS it is common to use Azure Data Lake Analytics (ADLA) 
to run batch processing jobs in U-SQL over the raw storage files.  When doing 
this ADLA supports extractors that can deal with the format of the file (e.g. 
Avro OCF) and extract file contents for downstream manipulation and filtering.

An issue I have encountered with the existing csharp implementation is that the 
DataFileReader relies on the provided stream to support seeking.  However, the 
stream provided by ADLA does not support seeking.  This leaves the integrating 
developer with 2 options...

# is to read the entire stream in to memory and provide a memory backed stream 
to the DataFileReader.  This is not ideal as files can be large and consuming a 
lot of memory at once during processing may have undesired affects on ADLA's 
ability to process files in parallel, as resources are of course limited.
# is to enhance the DataFileReader to be able to work with streams that are not 
seekable.  With respect to this option I have implemented a short-term 
workaround that can wrap a non-seekeable stream and allow seeking in the 
pattern employed by the DataFileReader until this feature has been reviewed and 
potentially implemented.  My workaround is brittle and subject to breaking as 
the DataFileReader evolves and is not the desired long term approach to dealing 
with this issue.
[AvroDataFileReaderStream.cs|https://github.com/thebothead/apache-avro-adla/blob/master/src/Avro.IO.ADLA/AvroDataFileReaderStream.cs]

Cheers,
Matt

  was:
The Microsoft Azure environment supports saving Apache Avro files from an Event 
Hub via a feature called Event Hub Capture.  The Event Hub Capture feature can 
be configured to Azure Data Lake Storage (ADLS).

When saving files to ADLS it is common to use Azure Data Lake Analytics (ADLA) 
to run batch processing jobs in U-SQL over the raw storage files.  When doing 
this ADLA supports extractors that can deal with the format of the file (e.g. 
Avro OCF) and extract file contents for downstream manipulation and filtering.

An issue I have encountered with the existing csharp implementation is that the 
DataFileReader relies on the provided stream to support seeking.  However, the 
stream provided by ADLA does not support seeking.  This leaves the integrating 
developer with 2 options...

1 is to read the entire stream in to memory and provide a memory backed stream 
to the DataFileReader.  This is not ideal as files can be large and consuming a 
lot of memory at once during processing may have undesired affects on ADLA's 
ability to process files in parallel, as resources are of course limited.

2 is to enhance the DataFileReader to be able to work with streams that are not 
seekable.  With respect to this option I have implemented a short-term 
workaround that can wrap a non-seekeable stream and allow seeking in the 
pattern employed by the DataFileReader until this feature has been reviewed and 
potentially implemented.  My workaround is brittle and subject to breaking as 
the DataFileReader evolves and is not the desired long term approach to dealing 
with this issue.

[AvroDataFileReaderStream.cs|https://github.com/thebothead/apache-avro-adla/blob/master/src/Avro.IO.ADLA/AvroDataFileReaderStream.cs]

Cheers,
Matt


> Avro OCF support for non-seekable stream.
> -----------------------------------------
>
>                 Key: AVRO-2098
>                 URL: https://issues.apache.org/jira/browse/AVRO-2098
>             Project: Avro
>          Issue Type: New Feature
>          Components: csharp
>    Affects Versions: 1.8.2
>         Environment: csharp
> Azure Data Lake Analytics
>            Reporter: Matthew Stowe
>            Priority: Minor
>
> The Microsoft Azure environment supports saving Apache Avro files from an 
> Event Hub via a feature called Event Hub Capture.  The Event Hub Capture 
> feature can be configured to Azure Data Lake Storage (ADLS).
> When saving files to ADLS it is common to use Azure Data Lake Analytics 
> (ADLA) to run batch processing jobs in U-SQL over the raw storage files.  
> When doing this ADLA supports extractors that can deal with the format of the 
> file (e.g. Avro OCF) and extract file contents for downstream manipulation 
> and filtering.
> An issue I have encountered with the existing csharp implementation is that 
> the DataFileReader relies on the provided stream to support seeking.  
> However, the stream provided by ADLA does not support seeking.  This leaves 
> the integrating developer with 2 options...
> # is to read the entire stream in to memory and provide a memory backed 
> stream to the DataFileReader.  This is not ideal as files can be large and 
> consuming a lot of memory at once during processing may have undesired 
> affects on ADLA's ability to process files in parallel, as resources are of 
> course limited.
> # is to enhance the DataFileReader to be able to work with streams that are 
> not seekable.  With respect to this option I have implemented a short-term 
> workaround that can wrap a non-seekeable stream and allow seeking in the 
> pattern employed by the DataFileReader until this feature has been reviewed 
> and potentially implemented.  My workaround is brittle and subject to 
> breaking as the DataFileReader evolves and is not the desired long term 
> approach to dealing with this issue.
> [AvroDataFileReaderStream.cs|https://github.com/thebothead/apache-avro-adla/blob/master/src/Avro.IO.ADLA/AvroDataFileReaderStream.cs]
> Cheers,
> Matt



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to