Re: S3 Output Module

Mohit Jotwani Thu, 27 Oct 2016 20:33:24 -0700

+1 for Solution 2

Regards,
Mohit
On 27 Oct 2016 2:02 p.m., "Sandeep Deshmukh" <[email protected]>
wrote:


> +1
>
> Regards,
> Sandeep
>
> On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu <
> [email protected]> wrote:
>
> > Hi All,
> >
> >   I am planning to implement the approach (2) of S3 Output Module which I
> > proposed in my previous email. Performance would be better as compared to
> > approach (1) because of uploading the blocks without saving it on HDFS.
> >
> >   Please share your opinions.
> >
> > Regards,
> > Chaitanya
> >
> > On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu <
> > [email protected]> wrote:
> >
> > > Hi All,
> > >
> > > I am proposing the below new design for S3 Output Module using multi
> part
> > > upload feature:
> > >
> > > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord
> > >
> > > Steps for uploading files using S3 multipart feature:
> > >
> > > =============================
> > >
> > >    1.
> > >
> > >    Initiate the upload. S3 will return upload id.
> > >
> > > Mandatory : bucket name, file path
> > >
> > > Note: Upload id is the unique identifier for multi part upload of a
> file.
> > >
> > >    1.
> > >
> > >    Upload each block using the received upload id. S3 will return ETag
> in
> > >    response of each upload.
> > >
> > > Mandatory: block number, upload id
> > >
> > >    1.
> > >
> > >    Send the merge request by providing the upload id and list of ETags
> .
> > >
> > > Mandatory: upload id, file path, block ETags.
> > >
> > > Here
> > > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html>
> > is
> > > an example link for uploading a file using multi part feature:
> > >
> > >
> > > I am proposing the below two approaches for S3 output module.
> > >
> > >
> > > (Solution 1)
> > >
> > > S3 Output Module consists of the below two operators:
> > >
> > > 1) BlockWriter : Write the blocks into the HDFS. Once successfully
> > written
> > > into HDFS, then this will emit the BlockMetadata.
> > >
> > > 2) S3MultiPartUpload: This consists of two parts:
> > >
> > >      a) If the number of blocks of a file is > 1 then upload the blocks
> > > using multi part feature. Otherwise, will upload the block using
> > > putObject().
> > >
> > >      b) Once all the blocks are successfully uploaded then will send
> the
> > > merge complete request.
> > >
> > >
> > > (Solution 2)
> > >
> > > DAG for this solution as follows:
> > >
> > > 1) InitateS3Upload:
> > >
> > > Input: FileMetadata
> > >
> > > Initiates the upload. This operator emits (filemetadata, uploadId) to
> > > S3FileMerger and (filePath, uploadId) to S3BlockUpload.
> > >
> > > 2) S3BlockUpload:
> > >
> > > Input: FileBlockMetadata, ReaderRecord
> > >
> > > Upload the blocks into S3. S3 will return ETag for each upload.
> > > S3BlockUpload emits (path, ETag) to S3FileMerger.
> > >
> > > 3) S3FileMerger: Sends the file merge request to S3.
> > >
> > > Pros:
> > >
> > > (1) Supports the size of file to upload is up to 5 TB.
> > >
> > > (2) Reduces the end to end latency. Because, we are not waiting to
> upload
> > > until all the blocks of a file written to HDFS.
> > >
> > > Please vote and share your thoughts on these approaches.
> > >
> > > Regards,
> > > Chaitanya
> > >
> > > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu <
> > > [email protected]> wrote:
> > >
> > >> @ Tushar
> > >>
> > >>   S3 Copy Output Module consists of following operators:
> > >> 1) BlockWriter : Writes the blocks into the HDFS.
> > >> 2) Synchronizer: Sends trigger to downstream operator, when all the
> > >> blocks for a file written to HDFS.
> > >> 3) FileMerger: Merges all the blocks into a file and will upload the
> > >> merged file into S3 bucket.
> > >>
> > >> @ Ashwin
> > >>
> > >>     Good suggestion. In the first iteration, I will add the proposed
> > >> design.
> > >> Multipart support will add it in the next iteration.
> > >>
> > >> Regards,
> > >> Chaitanya
> > >>
> > >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
> > >> [email protected]> wrote:
> > >>
> > >>> +1 regarding the s3 upload functionality.
> > >>>
> > >>> However, I think we should just focus on multipart upload directly as
> > it
> > >>> comes with various advantages like higher throughput, faster
> recovery,
> > >>> not
> > >>> needing to wait for entire file being created before uploading each
> > part.
> > >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin
> > >>> gmpu.html
> > >>>
> > >>> Also, seems like we can do multipart upload if the file size is more
> > than
> > >>> 5MB. They do recommend using multipart if the file size is more than
> > >>> 100MB.
> > >>> I am not sure if there is a hard lower limit though. See:
> > >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
> > >>>
> > >>> This way, it seems like we don't to have to wait until a file is
> > >>> completely
> > >>> written to hdfs before performing the upload operation.
> > >>>
> > >>> Regards,
> > >>> Ashwin.
> > >>>
> > >>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <
> [email protected]
> > >
> > >>> wrote:
> > >>>
> > >>> > +1 , we need this functionality.
> > >>> >
> > >>> > Is it going to be a single operator or multiple operators? If
> > multiple
> > >>> > operators, then can you explain what functionality each operator
> will
> > >>> > provide?
> > >>> >
> > >>> >
> > >>> > Regards,
> > >>> > -Tushar.
> > >>> >
> > >>> >
> > >>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <
> > >>> [email protected]>
> > >>> > wrote:
> > >>> >
> > >>> > > Writing to S3 is a common use-case for applications.
> > >>> > > This module will be definitely helpful.
> > >>> > >
> > >>> > > +1 for adding this module.
> > >>> > >
> > >>> > >
> > >>> > > ~ Yogi
> > >>> > >
> > >>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
> > >>> [email protected]>
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hi All,
> > >>> > > >
> > >>> > > >   I am proposing S3 output copy Module. Primary functionality
> of
> > >>> this
> > >>> > > > module is uploading files to S3 bucket using block-by-block
> > >>> approach.
> > >>> > > >
> > >>> > > >   Below is the JIRA created for this task:
> > >>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
> > >>> > > >
> > >>> > > >   Design of this module is similar to HDFS copy module. So, I
> > will
> > >>> > extend
> > >>> > > > HDFS copy module for S3.
> > >>> > > >
> > >>> > > > Design of this Module:
> > >>> > > > =======================
> > >>> > > > 1) Writing blocks into HDFS.
> > >>> > > > 2) Merge the blocks into a file .
> > >>> > > > 3) Upload the above merged file into S3 Bucket using
> > AmazonS3Client
> > >>> > > API's.
> > >>> > > >
> > >>> > > > Steps (1) & (2) are same as HDFS copy module.
> > >>> > > >
> > >>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please
> > >>> refer the
> > >>> > > > below link about limitations of Uploading objects into S3:
> > >>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje
> > >>> cts.html
> > >>> > > >
> > >>> > > > We can resolve the above limitation by using S3 Multipart
> > feature.
> > >>> I
> > >>> > will
> > >>> > > > add multipart support in next iteration.
> > >>> > > >
> > >>> > > >  Please share your thoughts on this.
> > >>> > > >
> > >>> > > > Regards,
> > >>> > > > Chaitanya
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Regards,
> > >>> Ashwin.
> > >>>
> > >>
> > >>
> > >
> >
>

Re: S3 Output Module

Reply via email to