+1 for Solution 2 Regards, Mohit On 27 Oct 2016 2:02 p.m., "Sandeep Deshmukh" <[email protected]> wrote:
> +1 > > Regards, > Sandeep > > On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu < > [email protected]> wrote: > > > Hi All, > > > > I am planning to implement the approach (2) of S3 Output Module which I > > proposed in my previous email. Performance would be better as compared to > > approach (1) because of uploading the blocks without saving it on HDFS. > > > > Please share your opinions. > > > > Regards, > > Chaitanya > > > > On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu < > > [email protected]> wrote: > > > > > Hi All, > > > > > > I am proposing the below new design for S3 Output Module using multi > part > > > upload feature: > > > > > > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord > > > > > > Steps for uploading files using S3 multipart feature: > > > > > > ============================= > > > > > > 1. > > > > > > Initiate the upload. S3 will return upload id. > > > > > > Mandatory : bucket name, file path > > > > > > Note: Upload id is the unique identifier for multi part upload of a > file. > > > > > > 1. > > > > > > Upload each block using the received upload id. S3 will return ETag > in > > > response of each upload. > > > > > > Mandatory: block number, upload id > > > > > > 1. > > > > > > Send the merge request by providing the upload id and list of ETags > . > > > > > > Mandatory: upload id, file path, block ETags. > > > > > > Here > > > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html> > > is > > > an example link for uploading a file using multi part feature: > > > > > > > > > I am proposing the below two approaches for S3 output module. > > > > > > > > > (Solution 1) > > > > > > S3 Output Module consists of the below two operators: > > > > > > 1) BlockWriter : Write the blocks into the HDFS. Once successfully > > written > > > into HDFS, then this will emit the BlockMetadata. > > > > > > 2) S3MultiPartUpload: This consists of two parts: > > > > > > a) If the number of blocks of a file is > 1 then upload the blocks > > > using multi part feature. Otherwise, will upload the block using > > > putObject(). > > > > > > b) Once all the blocks are successfully uploaded then will send > the > > > merge complete request. > > > > > > > > > (Solution 2) > > > > > > DAG for this solution as follows: > > > > > > 1) InitateS3Upload: > > > > > > Input: FileMetadata > > > > > > Initiates the upload. This operator emits (filemetadata, uploadId) to > > > S3FileMerger and (filePath, uploadId) to S3BlockUpload. > > > > > > 2) S3BlockUpload: > > > > > > Input: FileBlockMetadata, ReaderRecord > > > > > > Upload the blocks into S3. S3 will return ETag for each upload. > > > S3BlockUpload emits (path, ETag) to S3FileMerger. > > > > > > 3) S3FileMerger: Sends the file merge request to S3. > > > > > > Pros: > > > > > > (1) Supports the size of file to upload is up to 5 TB. > > > > > > (2) Reduces the end to end latency. Because, we are not waiting to > upload > > > until all the blocks of a file written to HDFS. > > > > > > Please vote and share your thoughts on these approaches. > > > > > > Regards, > > > Chaitanya > > > > > > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu < > > > [email protected]> wrote: > > > > > >> @ Tushar > > >> > > >> S3 Copy Output Module consists of following operators: > > >> 1) BlockWriter : Writes the blocks into the HDFS. > > >> 2) Synchronizer: Sends trigger to downstream operator, when all the > > >> blocks for a file written to HDFS. > > >> 3) FileMerger: Merges all the blocks into a file and will upload the > > >> merged file into S3 bucket. > > >> > > >> @ Ashwin > > >> > > >> Good suggestion. In the first iteration, I will add the proposed > > >> design. > > >> Multipart support will add it in the next iteration. > > >> > > >> Regards, > > >> Chaitanya > > >> > > >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta < > > >> [email protected]> wrote: > > >> > > >>> +1 regarding the s3 upload functionality. > > >>> > > >>> However, I think we should just focus on multipart upload directly as > > it > > >>> comes with various advantages like higher throughput, faster > recovery, > > >>> not > > >>> needing to wait for entire file being created before uploading each > > part. > > >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin > > >>> gmpu.html > > >>> > > >>> Also, seems like we can do multipart upload if the file size is more > > than > > >>> 5MB. They do recommend using multipart if the file size is more than > > >>> 100MB. > > >>> I am not sure if there is a hard lower limit though. See: > > >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html > > >>> > > >>> This way, it seems like we don't to have to wait until a file is > > >>> completely > > >>> written to hdfs before performing the upload operation. > > >>> > > >>> Regards, > > >>> Ashwin. > > >>> > > >>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi < > [email protected] > > > > > >>> wrote: > > >>> > > >>> > +1 , we need this functionality. > > >>> > > > >>> > Is it going to be a single operator or multiple operators? If > > multiple > > >>> > operators, then can you explain what functionality each operator > will > > >>> > provide? > > >>> > > > >>> > > > >>> > Regards, > > >>> > -Tushar. > > >>> > > > >>> > > > >>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra < > > >>> [email protected]> > > >>> > wrote: > > >>> > > > >>> > > Writing to S3 is a common use-case for applications. > > >>> > > This module will be definitely helpful. > > >>> > > > > >>> > > +1 for adding this module. > > >>> > > > > >>> > > > > >>> > > ~ Yogi > > >>> > > > > >>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu < > > >>> [email protected]> > > >>> > > wrote: > > >>> > > > > >>> > > > Hi All, > > >>> > > > > > >>> > > > I am proposing S3 output copy Module. Primary functionality > of > > >>> this > > >>> > > > module is uploading files to S3 bucket using block-by-block > > >>> approach. > > >>> > > > > > >>> > > > Below is the JIRA created for this task: > > >>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022 > > >>> > > > > > >>> > > > Design of this module is similar to HDFS copy module. So, I > > will > > >>> > extend > > >>> > > > HDFS copy module for S3. > > >>> > > > > > >>> > > > Design of this Module: > > >>> > > > ======================= > > >>> > > > 1) Writing blocks into HDFS. > > >>> > > > 2) Merge the blocks into a file . > > >>> > > > 3) Upload the above merged file into S3 Bucket using > > AmazonS3Client > > >>> > > API's. > > >>> > > > > > >>> > > > Steps (1) & (2) are same as HDFS copy module. > > >>> > > > > > >>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please > > >>> refer the > > >>> > > > below link about limitations of Uploading objects into S3: > > >>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje > > >>> cts.html > > >>> > > > > > >>> > > > We can resolve the above limitation by using S3 Multipart > > feature. > > >>> I > > >>> > will > > >>> > > > add multipart support in next iteration. > > >>> > > > > > >>> > > > Please share your thoughts on this. > > >>> > > > > > >>> > > > Regards, > > >>> > > > Chaitanya > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > > >>> > > >>> -- > > >>> > > >>> Regards, > > >>> Ashwin. > > >>> > > >> > > >> > > > > > >
