Hi

Following is another approach for getting information regarding the file
length for S3.

We have an existing class FileBlockMetadata which currently contains only
filePath. To this, we can add the fileLength field which will then get
passed to the module. This approach will be a lot cleaner and no additional
requests will be made to S3 in this case.

Kindly provide your opinion on which approach would be best suited.


Regards,
Ajay

On Wed, Oct 19, 2016 at 6:43 PM, AJAY GUPTA <ajaygit...@gmail.com> wrote:

> Hi
>
> I need suggestion of Apex dev community on the following.
>
> For the S3RecordReader approach mentioned in previous mail, I am facing an
> issue with determining the end of file.
> Note that the input to this operator will not contain the file size.
>
> Following approaches are possible
>
> 1) The S3 getObject() call which fetches file data within a range will
> throw an AmazonS3Exception if the range provided is out of bounds. Hence if
> file size is 10bytes and if I make a getObject request for 11 to 15, I will
> get this exception.
> Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception:
> The requested range is not satisfiable (Service: Amazon S3; Status Code:
> 416; Error Code: InvalidRange; Request ID:
> If this exception gets thrown, I can catch it in the code and conclude
> that end of file is reached.
>
> 2) For every container running this application, maintain a map<filename,
> filesize>. If the filesize already exists in this map, use from there. If
> not, fetch the filesize information from S3 and add it to this map.
>
> My own opinion is to go with the first approach since the number of calls
> to S3 for getting file length will be less.
> Kindly provide with any other approaches you can think of.
>
>
> Thanks,
> Ajay
>
>
>
> On Wed, Oct 19, 2016 at 11:53 AM, AJAY GUPTA <ajaygit...@gmail.com> wrote:
>
>> Hi Apex Dev community,
>>
>> Kindly provide with feedback if any for the following approach for
>> implementing S3RecordReader.
>>
>> *S3RecordReader(delimited records)*
>> *Input *: BlockMetaData containing offset and length
>> *Expected Output :* Records in the block
>> *Approach : *
>> Similar to approach currently being followed in FSRecordReader.
>> 1) Fetch the block from S3. S3 block fetch size should ideally be large
>> enough, say 64MB to avoid unnecessary network delays.
>> 2) Search for newline character in the block and emit the record
>> 3) The last record in current block might overflow into subsequent block.
>> For this, we will get a small part of subsequent block, say 1 MB and search
>> for newline character and emit the record if newline character is found. We
>> will fetch additional 1MB blocks till a newline charater is found.
>> 4) We will also avoid reading the first record from all blocks (except
>> first block) as this set of bytes is a part of last record in previous
>> block.
>>
>>
>> Regards,
>> Ajay
>>
>>
>>
>> On Wed, Oct 19, 2016 at 7:31 AM, Ajay Gupta (JIRA) <j...@apache.org>
>> wrote:
>>
>>>
>>>      [ https://issues.apache.org/jira/browse/APEXMALHAR-2303?page=c
>>> om.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>>>
>>> Ajay Gupta reassigned APEXMALHAR-2303:
>>> --------------------------------------
>>>
>>>     Assignee: Ajay Gupta
>>>
>>> > S3 Line By Line Module
>>> > ----------------------
>>> >
>>> >                 Key: APEXMALHAR-2303
>>> >                 URL: https://issues.apache.org/jira
>>> /browse/APEXMALHAR-2303
>>> >             Project: Apache Apex Malhar
>>> >          Issue Type: Bug
>>> >            Reporter: Ajay Gupta
>>> >            Assignee: Ajay Gupta
>>> >   Original Estimate: 336h
>>> >  Remaining Estimate: 336h
>>> >
>>> > This is a new module which will consist of 2 operators
>>> > 1) File Splitter -- Already existing in Malhar library
>>> > 2) S3RecordReader -- Read a file from S3 and output the records
>>> (delimited or fixed width)
>>>
>>>
>>>
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.3.4#6332)
>>>
>>
>>
>

Reply via email to