[
https://issues.apache.org/jira/browse/BEAM-5959?focusedWorklogId=176211&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-176211
]
ASF GitHub Bot logged work on BEAM-5959:
----------------------------------------
Author: ASF GitHub Bot
Created on: 17/Dec/18 19:32
Start Date: 17/Dec/18 19:32
Worklog Time Spent: 10m
Work Description: udim commented on a change in pull request #7266:
[BEAM-5959] Add performance testing for writing many files
URL: https://github.com/apache/beam/pull/7266#discussion_r242285792
##########
File path:
sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java
##########
@@ -758,7 +758,10 @@ final void moveToOutputFiles(
}
// During a failure case, files may have been deleted in an earlier
step. Thus
// we ignore missing files here.
+ long startTime = System.nanoTime();
FileSystems.rename(srcFiles, dstFiles,
StandardMoveOptions.IGNORE_MISSING_FILES);
+ long endTime = System.nanoTime();
+ LOG.info("Renamed {} files in {} seconds.", srcFiles.size(), (endTime -
startTime) / 1e9);
Review comment:
I've done some investigation on Friday.
1. A correction: the log doesn't appear in the terminal - you have to go to
Stackdriver logs. Having a history of values is important to detect and
investigate regressions, and Stackdriver logs are not kept around for long.
Regardkess, I created a bug to log gradle output in perfkit tests:
https://issues.apache.org/jira/browse/BEAM-6251
1. It is possible to report metrics from the workers and query them from the
code that launches the pipeline, such as TextIOIT.java. These metrics could be
logged and then picked up by
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/blob/master/perfkitbenchmarker/linux_benchmarks/beam_integration_benchmark.py
using `regex_util.ExtractAllMatches` and added to the benchmark results (using
`sample.Sample`
https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/blob/2fc792d137181f2b273fdb7842a3cf513b273e4d/perfkitbenchmarker/linux_benchmarks/beam_integration_benchmark.py#L163).
Parsing log files with regexes feels hacky though. I prefer your solution of
writing metrics to BQ, and I'd rather write all results (incl. `run_time`) to
BQ from TextIOIT and ignore Perfkit's reporting. WDYT?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 176211)
Time Spent: 7h 20m (was: 7h 10m)
> Add Cloud KMS support to GCS copies
> -----------------------------------
>
> Key: BEAM-5959
> URL: https://issues.apache.org/jira/browse/BEAM-5959
> Project: Beam
> Issue Type: Bug
> Components: io-java-gcp, sdk-py-core
> Reporter: Udi Meiri
> Assignee: Udi Meiri
> Priority: Major
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> Beam SDK currently uses the CopyTo GCS API call, which doesn't support
> copying objects that Customer Managed Encryption Keys (CMEK).
> CMEKs are managed in Cloud KMS.
> Items (for Java and Python SDKs):
> - Update clients to versions that support KMS keys.
> - Change copyTo API calls to use rewriteTo (Python - directly, Java -
> possibly convert copyTo API call to use client library)
> - Add unit tests.
> - Add basic tests (DirectRunner and GCS buckets with CMEK).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)