[jira] [Resolved] (APEXMALHAR-2369) S3 output module for tuple based output
[ https://issues.apache.org/jira/browse/APEXMALHAR-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaitanya resolved APEXMALHAR-2369. --- Resolution: Fixed Fix Version/s: 3.7.0 > S3 output module for tuple based output > --- > > Key: APEXMALHAR-2369 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2369 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Yogi Devendra >Assignee: Yogi Devendra > Fix For: 3.7.0 > > > Currently, S3 output is available using S3OutputModule which is restricted > for copying files from FileSystem to S3. Use-cases where all the > tuples/records to be written to S3 cannot use this approach. Thus, we need to > develop alternative module which would take care of writing tuples on S3. > Design: > Sending separate requests to S3 for each tuple would be too expensive. This > module can choose to write tuples to HDFS. And then upload HDFS files to S3. > This would lead to some end-to-end latency. But, it should OK for the S3 > output case. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2369) S3 output module for tuple based output
[ https://issues.apache.org/jira/browse/APEXMALHAR-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884302#comment-15884302 ] ASF GitHub Bot commented on APEXMALHAR-2369: Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/542 > S3 output module for tuple based output > --- > > Key: APEXMALHAR-2369 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2369 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Yogi Devendra >Assignee: Yogi Devendra > Fix For: 3.7.0 > > > Currently, S3 output is available using S3OutputModule which is restricted > for copying files from FileSystem to S3. Use-cases where all the > tuples/records to be written to S3 cannot use this approach. Thus, we need to > develop alternative module which would take care of writing tuples on S3. > Design: > Sending separate requests to S3 for each tuple would be too expensive. This > module can choose to write tuples to HDFS. And then upload HDFS files to S3. > This would lead to some end-to-end latency. But, it should OK for the S3 > output case. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (APEXMALHAR-2369) S3 output module for tuple based output
[ https://issues.apache.org/jira/browse/APEXMALHAR-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837265#comment-15837265 ] Yogi Devendra commented on APEXMALHAR-2369: --- [~chaithu] Could you please review this? > S3 output module for tuple based output > --- > > Key: APEXMALHAR-2369 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2369 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Yogi Devendra >Assignee: Yogi Devendra > > Currently, S3 output is available using S3OutputModule which is restricted > for copying files from FileSystem to S3. Use-cases where all the > tuples/records to be written to S3 cannot use this approach. Thus, we need to > develop alternative module which would take care of writing tuples on S3. > Design: > Sending separate requests to S3 for each tuple would be too expensive. This > module can choose to write tuples to HDFS. And then upload HDFS files to S3. > This would lead to some end-to-end latency. But, it should OK for the S3 > output case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (APEXMALHAR-2369) S3 output module for tuple based output
[ https://issues.apache.org/jira/browse/APEXMALHAR-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yogi Devendra updated APEXMALHAR-2369: -- Description: Currently, S3 output is available using S3OutputModule which is restricted for copying files from FileSystem to S3. Use-cases where all the tuples/records to be written to S3 cannot use this approach. Thus, we need to develop alternative module which would take care of writing tuples on S3. Design: Sending separate requests to S3 for each tuple would be too expensive. This module can choose to write tuples to HDFS. And then upload HDFS files to S3. This would lead to some end-to-end latency. But, it should OK for the S3 output case. was:Currently, S3 output is available using S3OutputModule which is restricted for copying files from FileSystem to S3. Use-cases where all the tuples/records to be written to S3 cannot use this approach. Thus, we need to develop alternative module which would take care of writing tuples on S3. Design: Sending separate requests to S3 for each tuple would be too expensive. This module can choose to write tuples to HDFS. And then upload HDFS files to S3. This would lead to some end-to-end latency. But, it should OK for the S3 output case. > S3 output module for tuple based output > --- > > Key: APEXMALHAR-2369 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2369 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Yogi Devendra >Assignee: Yogi Devendra > > Currently, S3 output is available using S3OutputModule which is restricted > for copying files from FileSystem to S3. Use-cases where all the > tuples/records to be written to S3 cannot use this approach. Thus, we need to > develop alternative module which would take care of writing tuples on S3. > Design: > Sending separate requests to S3 for each tuple would be too expensive. This > module can choose to write tuples to HDFS. And then upload HDFS files to S3. > This would lead to some end-to-end latency. But, it should OK for the S3 > output case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (APEXMALHAR-2369) S3 output module for tuple based output
Yogi Devendra created APEXMALHAR-2369: - Summary: S3 output module for tuple based output Key: APEXMALHAR-2369 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2369 Project: Apache Apex Malhar Issue Type: Task Reporter: Yogi Devendra Assignee: Yogi Devendra Currently, S3 output is available using S3OutputModule which is restricted for copying files from FileSystem to S3. Use-cases where all the tuples/records to be written to S3 cannot use this approach. Thus, we need to develop alternative module which would take care of writing tuples on S3. Design: Sending separate requests to S3 for each tuple would be too expensive. This module can choose to write tuples to HDFS. And then upload HDFS files to S3. This would lead to some end-to-end latency. But, it should OK for the S3 output case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhupesh Chawda resolved APEXMALHAR-2022. Resolution: Fixed Fix Version/s: 3.7.0 > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > Fix For: 3.7.0 > > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15710953#comment-15710953 ] ASF GitHub Bot commented on APEXMALHAR-2022: Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/483 > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
Github user asfgit closed the pull request at: https://github.com/apache/apex-malhar/pull/483 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15708479#comment-15708479 ] ASF GitHub Bot commented on APEXMALHAR-2022: Github user chaithu14 closed the pull request at: https://github.com/apache/apex-malhar/pull/483 > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15708481#comment-15708481 ] ASF GitHub Bot commented on APEXMALHAR-2022: GitHub user chaithu14 reopened a pull request: https://github.com/apache/apex-malhar/pull/483 APEXMALHAR-2022 Developed S3 Output Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/chaithu14/incubator-apex-malhar APEXMALHAR-2022-S3Output-multiPart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #483 commit a5e8fa3facca750f5d7402c2c29e7cbabe53bd9e Author: chaitanya <chai...@apache.org> Date: 2016-11-30T05:17:36Z APEXMALHAR-2022 Development of S3 Output Module ---- > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
GitHub user chaithu14 reopened a pull request: https://github.com/apache/apex-malhar/pull/483 APEXMALHAR-2022 Developed S3 Output Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/chaithu14/incubator-apex-malhar APEXMALHAR-2022-S3Output-multiPart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #483 commit a5e8fa3facca750f5d7402c2c29e7cbabe53bd9e Author: chaitanya <chai...@apache.org> Date: 2016-11-30T05:17:36Z APEXMALHAR-2022 Development of S3 Output Module --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
Github user chaithu14 closed the pull request at: https://github.com/apache/apex-malhar/pull/483 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15708392#comment-15708392 ] ASF GitHub Bot commented on APEXMALHAR-2022: Github user chaithu14 closed the pull request at: https://github.com/apache/apex-malhar/pull/483 > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
GitHub user chaithu14 reopened a pull request: https://github.com/apache/apex-malhar/pull/483 APEXMALHAR-2022 Developed S3 Output Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/chaithu14/incubator-apex-malhar APEXMALHAR-2022-S3Output-multiPart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #483 commit a5e8fa3facca750f5d7402c2c29e7cbabe53bd9e Author: chaitanya <chai...@apache.org> Date: 2016-11-30T05:17:36Z APEXMALHAR-2022 Development of S3 Output Module --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15708393#comment-15708393 ] ASF GitHub Bot commented on APEXMALHAR-2022: GitHub user chaithu14 reopened a pull request: https://github.com/apache/apex-malhar/pull/483 APEXMALHAR-2022 Developed S3 Output Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/chaithu14/incubator-apex-malhar APEXMALHAR-2022-S3Output-multiPart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #483 commit a5e8fa3facca750f5d7402c2c29e7cbabe53bd9e Author: chaitanya <chai...@apache.org> Date: 2016-11-30T05:17:36Z APEXMALHAR-2022 Development of S3 Output Module ---- > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
Github user chaithu14 closed the pull request at: https://github.com/apache/apex-malhar/pull/483 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15701535#comment-15701535 ] ASF GitHub Bot commented on APEXMALHAR-2022: Github user chaithu14 closed the pull request at: https://github.com/apache/apex-malhar/pull/483 > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
GitHub user chaithu14 reopened a pull request: https://github.com/apache/apex-malhar/pull/483 APEXMALHAR-2022 Developed S3 Output Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/chaithu14/incubator-apex-malhar APEXMALHAR-2022-S3Output-multiPart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #483 commit 6ab63bd92dc93ac4ddb3d6ce70d310cfa9322f82 Author: chaitanya <chai...@apache.org> Date: 2016-11-28T08:54:54Z APEXMALHAR-2022 Development of S3 Output ModuleAPEXMALHAR-2022 Development of S3 Output Module --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
Github user chaithu14 closed the pull request at: https://github.com/apache/apex-malhar/pull/483 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Kapoor updated APEXMALHAR-2022: -- Assignee: Chaitanya (was: Hitesh Kapoor) > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Chaitanya > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15636249#comment-15636249 ] ASF GitHub Bot commented on APEXMALHAR-2022: GitHub user chaithu14 opened a pull request: https://github.com/apache/apex-malhar/pull/483 APEXMALHAR-2022 Developed S3 Output Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/chaithu14/incubator-apex-malhar APEXMALHAR-2022-S3Output-multiPart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #483 commit 24fb5638ecb6f0e45edb5d5f640b220ad9372fcc Author: chaitanya <chai...@apache.org> Date: 2016-11-04T12:48:21Z APEXMALHAR-2022 Developed S3 Output Module ---- > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Hitesh Kapoor > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] apex-malhar pull request #483: APEXMALHAR-2022 Developed S3 Output Module
GitHub user chaithu14 opened a pull request: https://github.com/apache/apex-malhar/pull/483 APEXMALHAR-2022 Developed S3 Output Module You can merge this pull request into a Git repository by running: $ git pull https://github.com/chaithu14/incubator-apex-malhar APEXMALHAR-2022-S3Output-multiPart Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/483.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #483 commit 24fb5638ecb6f0e45edb5d5f640b220ad9372fcc Author: chaitanya <chai...@apache.org> Date: 2016-11-04T12:48:21Z APEXMALHAR-2022 Developed S3 Output Module --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
Re: S3 Output Module
+1 for Solution 2 Regards, Mohit On 27 Oct 2016 2:02 p.m., "Sandeep Deshmukh" <sand...@datatorrent.com> wrote: > +1 > > Regards, > Sandeep > > On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu < > chaita...@datatorrent.com> wrote: > > > Hi All, > > > > I am planning to implement the approach (2) of S3 Output Module which I > > proposed in my previous email. Performance would be better as compared to > > approach (1) because of uploading the blocks without saving it on HDFS. > > > > Please share your opinions. > > > > Regards, > > Chaitanya > > > > On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu < > > chaita...@datatorrent.com> wrote: > > > > > Hi All, > > > > > > I am proposing the below new design for S3 Output Module using multi > part > > > upload feature: > > > > > > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord > > > > > > Steps for uploading files using S3 multipart feature: > > > > > > = > > > > > >1. > > > > > >Initiate the upload. S3 will return upload id. > > > > > > Mandatory : bucket name, file path > > > > > > Note: Upload id is the unique identifier for multi part upload of a > file. > > > > > >1. > > > > > >Upload each block using the received upload id. S3 will return ETag > in > > >response of each upload. > > > > > > Mandatory: block number, upload id > > > > > >1. > > > > > >Send the merge request by providing the upload id and list of ETags > . > > > > > > Mandatory: upload id, file path, block ETags. > > > > > > Here > > > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html> > > is > > > an example link for uploading a file using multi part feature: > > > > > > > > > I am proposing the below two approaches for S3 output module. > > > > > > > > > (Solution 1) > > > > > > S3 Output Module consists of the below two operators: > > > > > > 1) BlockWriter : Write the blocks into the HDFS. Once successfully > > written > > > into HDFS, then this will emit the BlockMetadata. > > > > > > 2) S3MultiPartUpload: This consists of two parts: > > > > > > a) If the number of blocks of a file is > 1 then upload the blocks > > > using multi part feature. Otherwise, will upload the block using > > > putObject(). > > > > > > b) Once all the blocks are successfully uploaded then will send > the > > > merge complete request. > > > > > > > > > (Solution 2) > > > > > > DAG for this solution as follows: > > > > > > 1) InitateS3Upload: > > > > > > Input: FileMetadata > > > > > > Initiates the upload. This operator emits (filemetadata, uploadId) to > > > S3FileMerger and (filePath, uploadId) to S3BlockUpload. > > > > > > 2) S3BlockUpload: > > > > > > Input: FileBlockMetadata, ReaderRecord > > > > > > Upload the blocks into S3. S3 will return ETag for each upload. > > > S3BlockUpload emits (path, ETag) to S3FileMerger. > > > > > > 3) S3FileMerger: Sends the file merge request to S3. > > > > > > Pros: > > > > > > (1) Supports the size of file to upload is up to 5 TB. > > > > > > (2) Reduces the end to end latency. Because, we are not waiting to > upload > > > until all the blocks of a file written to HDFS. > > > > > > Please vote and share your thoughts on these approaches. > > > > > > Regards, > > > Chaitanya > > > > > > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu < > > > chaita...@datatorrent.com> wrote: > > > > > >> @ Tushar > > >> > > >> S3 Copy Output Module consists of following operators: > > >> 1) BlockWriter : Writes the blocks into the HDFS. > > >> 2) Synchronizer: Sends trigger to downstream operator, when all the > > >> blocks for a file written to HDFS. > > >> 3) FileMerger: Merges all the blocks into a file and will upload the > > >> merged file into S3 bucket. > > >> > > >> @ Ashwin > > >> > > >> Good suggestion. In the first iteration
Re: S3 Output Module
+1 Regards, Sandeep On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu < chaita...@datatorrent.com> wrote: > Hi All, > > I am planning to implement the approach (2) of S3 Output Module which I > proposed in my previous email. Performance would be better as compared to > approach (1) because of uploading the blocks without saving it on HDFS. > > Please share your opinions. > > Regards, > Chaitanya > > On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu < > chaita...@datatorrent.com> wrote: > > > Hi All, > > > > I am proposing the below new design for S3 Output Module using multi part > > upload feature: > > > > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord > > > > Steps for uploading files using S3 multipart feature: > > > > = > > > >1. > > > >Initiate the upload. S3 will return upload id. > > > > Mandatory : bucket name, file path > > > > Note: Upload id is the unique identifier for multi part upload of a file. > > > >1. > > > >Upload each block using the received upload id. S3 will return ETag in > >response of each upload. > > > > Mandatory: block number, upload id > > > >1. > > > >Send the merge request by providing the upload id and list of ETags . > > > > Mandatory: upload id, file path, block ETags. > > > > Here > > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html> > is > > an example link for uploading a file using multi part feature: > > > > > > I am proposing the below two approaches for S3 output module. > > > > > > (Solution 1) > > > > S3 Output Module consists of the below two operators: > > > > 1) BlockWriter : Write the blocks into the HDFS. Once successfully > written > > into HDFS, then this will emit the BlockMetadata. > > > > 2) S3MultiPartUpload: This consists of two parts: > > > > a) If the number of blocks of a file is > 1 then upload the blocks > > using multi part feature. Otherwise, will upload the block using > > putObject(). > > > > b) Once all the blocks are successfully uploaded then will send the > > merge complete request. > > > > > > (Solution 2) > > > > DAG for this solution as follows: > > > > 1) InitateS3Upload: > > > > Input: FileMetadata > > > > Initiates the upload. This operator emits (filemetadata, uploadId) to > > S3FileMerger and (filePath, uploadId) to S3BlockUpload. > > > > 2) S3BlockUpload: > > > > Input: FileBlockMetadata, ReaderRecord > > > > Upload the blocks into S3. S3 will return ETag for each upload. > > S3BlockUpload emits (path, ETag) to S3FileMerger. > > > > 3) S3FileMerger: Sends the file merge request to S3. > > > > Pros: > > > > (1) Supports the size of file to upload is up to 5 TB. > > > > (2) Reduces the end to end latency. Because, we are not waiting to upload > > until all the blocks of a file written to HDFS. > > > > Please vote and share your thoughts on these approaches. > > > > Regards, > > Chaitanya > > > > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu < > > chaita...@datatorrent.com> wrote: > > > >> @ Tushar > >> > >> S3 Copy Output Module consists of following operators: > >> 1) BlockWriter : Writes the blocks into the HDFS. > >> 2) Synchronizer: Sends trigger to downstream operator, when all the > >> blocks for a file written to HDFS. > >> 3) FileMerger: Merges all the blocks into a file and will upload the > >> merged file into S3 bucket. > >> > >> @ Ashwin > >> > >> Good suggestion. In the first iteration, I will add the proposed > >> design. > >> Multipart support will add it in the next iteration. > >> > >> Regards, > >> Chaitanya > >> > >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta < > >> ashwinchand...@gmail.com> wrote: > >> > >>> +1 regarding the s3 upload functionality. > >>> > >>> However, I think we should just focus on multipart upload directly as > it > >>> comes with various advantages like higher throughput, faster recovery, > >>> not > >>> needing to wait for entire file being created before uploading each > part. > >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/upl
Re: S3 Output Module
Hi All, I am planning to implement the approach (2) of S3 Output Module which I proposed in my previous email. Performance would be better as compared to approach (1) because of uploading the blocks without saving it on HDFS. Please share your opinions. Regards, Chaitanya On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu < chaita...@datatorrent.com> wrote: > Hi All, > > I am proposing the below new design for S3 Output Module using multi part > upload feature: > > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord > > Steps for uploading files using S3 multipart feature: > > = > >1. > >Initiate the upload. S3 will return upload id. > > Mandatory : bucket name, file path > > Note: Upload id is the unique identifier for multi part upload of a file. > >1. > >Upload each block using the received upload id. S3 will return ETag in >response of each upload. > > Mandatory: block number, upload id > >1. > >Send the merge request by providing the upload id and list of ETags . > > Mandatory: upload id, file path, block ETags. > > Here > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html> is > an example link for uploading a file using multi part feature: > > > I am proposing the below two approaches for S3 output module. > > > (Solution 1) > > S3 Output Module consists of the below two operators: > > 1) BlockWriter : Write the blocks into the HDFS. Once successfully written > into HDFS, then this will emit the BlockMetadata. > > 2) S3MultiPartUpload: This consists of two parts: > > a) If the number of blocks of a file is > 1 then upload the blocks > using multi part feature. Otherwise, will upload the block using > putObject(). > > b) Once all the blocks are successfully uploaded then will send the > merge complete request. > > > (Solution 2) > > DAG for this solution as follows: > > 1) InitateS3Upload: > > Input: FileMetadata > > Initiates the upload. This operator emits (filemetadata, uploadId) to > S3FileMerger and (filePath, uploadId) to S3BlockUpload. > > 2) S3BlockUpload: > > Input: FileBlockMetadata, ReaderRecord > > Upload the blocks into S3. S3 will return ETag for each upload. > S3BlockUpload emits (path, ETag) to S3FileMerger. > > 3) S3FileMerger: Sends the file merge request to S3. > > Pros: > > (1) Supports the size of file to upload is up to 5 TB. > > (2) Reduces the end to end latency. Because, we are not waiting to upload > until all the blocks of a file written to HDFS. > > Please vote and share your thoughts on these approaches. > > Regards, > Chaitanya > > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu < > chaita...@datatorrent.com> wrote: > >> @ Tushar >> >> S3 Copy Output Module consists of following operators: >> 1) BlockWriter : Writes the blocks into the HDFS. >> 2) Synchronizer: Sends trigger to downstream operator, when all the >> blocks for a file written to HDFS. >> 3) FileMerger: Merges all the blocks into a file and will upload the >> merged file into S3 bucket. >> >> @ Ashwin >> >> Good suggestion. In the first iteration, I will add the proposed >> design. >> Multipart support will add it in the next iteration. >> >> Regards, >> Chaitanya >> >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta < >> ashwinchand...@gmail.com> wrote: >> >>> +1 regarding the s3 upload functionality. >>> >>> However, I think we should just focus on multipart upload directly as it >>> comes with various advantages like higher throughput, faster recovery, >>> not >>> needing to wait for entire file being created before uploading each part. >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin >>> gmpu.html >>> >>> Also, seems like we can do multipart upload if the file size is more than >>> 5MB. They do recommend using multipart if the file size is more than >>> 100MB. >>> I am not sure if there is a hard lower limit though. See: >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html >>> >>> This way, it seems like we don't to have to wait until a file is >>> completely >>> written to hdfs before performing the upload operation. >>> >>> Regards, >>> Ashwin. >>> >>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <tus...@datatorrent.com> >>> wrote: >>> >>> > +1 , we need this functionality. >
Re: S3 Output Module
Hi All, I am proposing the below new design for S3 Output Module using multi part upload feature: Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord Steps for uploading files using S3 multipart feature: = 1. Initiate the upload. S3 will return upload id. Mandatory : bucket name, file path Note: Upload id is the unique identifier for multi part upload of a file. 1. Upload each block using the received upload id. S3 will return ETag in response of each upload. Mandatory: block number, upload id 1. Send the merge request by providing the upload id and list of ETags . Mandatory: upload id, file path, block ETags. Here <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html> is an example link for uploading a file using multi part feature: I am proposing the below two approaches for S3 output module. (Solution 1) S3 Output Module consists of the below two operators: 1) BlockWriter : Write the blocks into the HDFS. Once successfully written into HDFS, then this will emit the BlockMetadata. 2) S3MultiPartUpload: This consists of two parts: a) If the number of blocks of a file is > 1 then upload the blocks using multi part feature. Otherwise, will upload the block using putObject(). b) Once all the blocks are successfully uploaded then will send the merge complete request. (Solution 2) DAG for this solution as follows: 1) InitateS3Upload: Input: FileMetadata Initiates the upload. This operator emits (filemetadata, uploadId) to S3FileMerger and (filePath, uploadId) to S3BlockUpload. 2) S3BlockUpload: Input: FileBlockMetadata, ReaderRecord Upload the blocks into S3. S3 will return ETag for each upload. S3BlockUpload emits (path, ETag) to S3FileMerger. 3) S3FileMerger: Sends the file merge request to S3. Pros: (1) Supports the size of file to upload is up to 5 TB. (2) Reduces the end to end latency. Because, we are not waiting to upload until all the blocks of a file written to HDFS. Please vote and share your thoughts on these approaches. Regards, Chaitanya On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu < chaita...@datatorrent.com> wrote: > @ Tushar > > S3 Copy Output Module consists of following operators: > 1) BlockWriter : Writes the blocks into the HDFS. > 2) Synchronizer: Sends trigger to downstream operator, when all the blocks > for a file written to HDFS. > 3) FileMerger: Merges all the blocks into a file and will upload the > merged file into S3 bucket. > > @ Ashwin > > Good suggestion. In the first iteration, I will add the proposed > design. > Multipart support will add it in the next iteration. > > Regards, > Chaitanya > > On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta < > ashwinchand...@gmail.com> wrote: > >> +1 regarding the s3 upload functionality. >> >> However, I think we should just focus on multipart upload directly as it >> comes with various advantages like higher throughput, faster recovery, not >> needing to wait for entire file being created before uploading each part. >> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin >> gmpu.html >> >> Also, seems like we can do multipart upload if the file size is more than >> 5MB. They do recommend using multipart if the file size is more than >> 100MB. >> I am not sure if there is a hard lower limit though. See: >> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html >> >> This way, it seems like we don't to have to wait until a file is >> completely >> written to hdfs before performing the upload operation. >> >> Regards, >> Ashwin. >> >> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <tus...@datatorrent.com> >> wrote: >> >> > +1 , we need this functionality. >> > >> > Is it going to be a single operator or multiple operators? If multiple >> > operators, then can you explain what functionality each operator will >> > provide? >> > >> > >> > Regards, >> > -Tushar. >> > >> > >> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <yogideven...@apache.org >> > >> > wrote: >> > >> > > Writing to S3 is a common use-case for applications. >> > > This module will be definitely helpful. >> > > >> > > +1 for adding this module. >> > > >> > > >> > > ~ Yogi >> > > >> > > On 22 March 2016 at 13:52, Chaitanya Chebolu < >> chaita...@datatorrent.com> >> > > wrote: >> > > >> > > > Hi All, >> > > > >> > > > I am proposing S3 output copy
[jira] [Updated] (APEXMALHAR-2022) S3 Output Module for file copy
[ https://issues.apache.org/jira/browse/APEXMALHAR-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chaitanya updated APEXMALHAR-2022: -- Assignee: Hitesh Kapoor (was: Chaitanya) > S3 Output Module for file copy > -- > > Key: APEXMALHAR-2022 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2022 > Project: Apache Apex Malhar > Issue Type: Task >Reporter: Chaitanya >Assignee: Hitesh Kapoor > > Primary functionality of this module is copy files into S3 bucket using > block-by-block approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)