+1 Regards, Sandeep
On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu < [email protected]> wrote: > Hi All, > > I am planning to implement the approach (2) of S3 Output Module which I > proposed in my previous email. Performance would be better as compared to > approach (1) because of uploading the blocks without saving it on HDFS. > > Please share your opinions. > > Regards, > Chaitanya > > On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu < > [email protected]> wrote: > > > Hi All, > > > > I am proposing the below new design for S3 Output Module using multi part > > upload feature: > > > > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord > > > > Steps for uploading files using S3 multipart feature: > > > > ============================= > > > > 1. > > > > Initiate the upload. S3 will return upload id. > > > > Mandatory : bucket name, file path > > > > Note: Upload id is the unique identifier for multi part upload of a file. > > > > 1. > > > > Upload each block using the received upload id. S3 will return ETag in > > response of each upload. > > > > Mandatory: block number, upload id > > > > 1. > > > > Send the merge request by providing the upload id and list of ETags . > > > > Mandatory: upload id, file path, block ETags. > > > > Here > > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html> > is > > an example link for uploading a file using multi part feature: > > > > > > I am proposing the below two approaches for S3 output module. > > > > > > (Solution 1) > > > > S3 Output Module consists of the below two operators: > > > > 1) BlockWriter : Write the blocks into the HDFS. Once successfully > written > > into HDFS, then this will emit the BlockMetadata. > > > > 2) S3MultiPartUpload: This consists of two parts: > > > > a) If the number of blocks of a file is > 1 then upload the blocks > > using multi part feature. Otherwise, will upload the block using > > putObject(). > > > > b) Once all the blocks are successfully uploaded then will send the > > merge complete request. > > > > > > (Solution 2) > > > > DAG for this solution as follows: > > > > 1) InitateS3Upload: > > > > Input: FileMetadata > > > > Initiates the upload. This operator emits (filemetadata, uploadId) to > > S3FileMerger and (filePath, uploadId) to S3BlockUpload. > > > > 2) S3BlockUpload: > > > > Input: FileBlockMetadata, ReaderRecord > > > > Upload the blocks into S3. S3 will return ETag for each upload. > > S3BlockUpload emits (path, ETag) to S3FileMerger. > > > > 3) S3FileMerger: Sends the file merge request to S3. > > > > Pros: > > > > (1) Supports the size of file to upload is up to 5 TB. > > > > (2) Reduces the end to end latency. Because, we are not waiting to upload > > until all the blocks of a file written to HDFS. > > > > Please vote and share your thoughts on these approaches. > > > > Regards, > > Chaitanya > > > > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu < > > [email protected]> wrote: > > > >> @ Tushar > >> > >> S3 Copy Output Module consists of following operators: > >> 1) BlockWriter : Writes the blocks into the HDFS. > >> 2) Synchronizer: Sends trigger to downstream operator, when all the > >> blocks for a file written to HDFS. > >> 3) FileMerger: Merges all the blocks into a file and will upload the > >> merged file into S3 bucket. > >> > >> @ Ashwin > >> > >> Good suggestion. In the first iteration, I will add the proposed > >> design. > >> Multipart support will add it in the next iteration. > >> > >> Regards, > >> Chaitanya > >> > >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta < > >> [email protected]> wrote: > >> > >>> +1 regarding the s3 upload functionality. > >>> > >>> However, I think we should just focus on multipart upload directly as > it > >>> comes with various advantages like higher throughput, faster recovery, > >>> not > >>> needing to wait for entire file being created before uploading each > part. > >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin > >>> gmpu.html > >>> > >>> Also, seems like we can do multipart upload if the file size is more > than > >>> 5MB. They do recommend using multipart if the file size is more than > >>> 100MB. > >>> I am not sure if there is a hard lower limit though. See: > >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html > >>> > >>> This way, it seems like we don't to have to wait until a file is > >>> completely > >>> written to hdfs before performing the upload operation. > >>> > >>> Regards, > >>> Ashwin. > >>> > >>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <[email protected] > > > >>> wrote: > >>> > >>> > +1 , we need this functionality. > >>> > > >>> > Is it going to be a single operator or multiple operators? If > multiple > >>> > operators, then can you explain what functionality each operator will > >>> > provide? > >>> > > >>> > > >>> > Regards, > >>> > -Tushar. > >>> > > >>> > > >>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra < > >>> [email protected]> > >>> > wrote: > >>> > > >>> > > Writing to S3 is a common use-case for applications. > >>> > > This module will be definitely helpful. > >>> > > > >>> > > +1 for adding this module. > >>> > > > >>> > > > >>> > > ~ Yogi > >>> > > > >>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu < > >>> [email protected]> > >>> > > wrote: > >>> > > > >>> > > > Hi All, > >>> > > > > >>> > > > I am proposing S3 output copy Module. Primary functionality of > >>> this > >>> > > > module is uploading files to S3 bucket using block-by-block > >>> approach. > >>> > > > > >>> > > > Below is the JIRA created for this task: > >>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022 > >>> > > > > >>> > > > Design of this module is similar to HDFS copy module. So, I > will > >>> > extend > >>> > > > HDFS copy module for S3. > >>> > > > > >>> > > > Design of this Module: > >>> > > > ======================= > >>> > > > 1) Writing blocks into HDFS. > >>> > > > 2) Merge the blocks into a file . > >>> > > > 3) Upload the above merged file into S3 Bucket using > AmazonS3Client > >>> > > API's. > >>> > > > > >>> > > > Steps (1) & (2) are same as HDFS copy module. > >>> > > > > >>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please > >>> refer the > >>> > > > below link about limitations of Uploading objects into S3: > >>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje > >>> cts.html > >>> > > > > >>> > > > We can resolve the above limitation by using S3 Multipart > feature. > >>> I > >>> > will > >>> > > > add multipart support in next iteration. > >>> > > > > >>> > > > Please share your thoughts on this. > >>> > > > > >>> > > > Regards, > >>> > > > Chaitanya > >>> > > > > >>> > > > >>> > > >>> > >>> > >>> > >>> -- > >>> > >>> Regards, > >>> Ashwin. > >>> > >> > >> > > >
