[
https://issues.apache.org/jira/browse/KUDU-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063080#comment-17063080
]
Todd Lipcon commented on KUDU-2432:
-----------------------------------
I looked into this a bit tonight since it's happening a lot lately. I sshed
into one of the slaves that had had a failure and ran 'docker logs' on the
dist-test slave container to get the full logs, and then grabbed the portion
corresponding to a failed job. It looks like the issue is that a first attempt
to download the files for the task failed with a "connection reset by peer"
error. The retries seem to fail because the directory already exists from the
first attempt. In other words, it's not a race, just broken retry logic. Will
look at the code next.
> isolate race creating directory via dist_test.py
> ------------------------------------------------
>
> Key: KUDU-2432
> URL: https://issues.apache.org/jira/browse/KUDU-2432
> Project: Kudu
> Issue Type: Bug
> Components: test
> Reporter: Mike Percy
> Priority: Major
> Attachments: logs.txt
>
>
> When running dist_test.py I have been getting a 1% failure rate due to the
> following errors.
> I am not sure if this is new or related to a single bad machine.
> {code:java}
> failed to download task files: WARNING 123 isolateserver(1484): Adding
> unknown file 7cf0792d18a9dbef867c9bce0c681b3def0510b6 to cache
> WARNING 126 isolateserver(1490): Added back 1 unknown files
> INFO 135 tools(106): Profiling: Section Setup took 0.045 seconds
> INFO 164 tools(106): Profiling: Section GetIsolateds took 0.029 seconds
> INFO 167 tools(106): Profiling: Section GetRest took 0.003 seconds
> INFO 175 isolateserver(1365): 1 ( 227022kb) added
> INFO 176 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 176 isolateserver(1372): 0 ( 0kb) removed
> INFO 176 isolateserver(1375): 45627408kb free
> INFO 176 tools(106): Profiling: Section CleanupTrimming took 0.009 seconds
> INFO 177 isolateserver(1365): 1 ( 227022kb) added
> INFO 177 isolateserver(1369): 1642 ( 3864634kb) current
> INFO 177 isolateserver(1372): 0 ( 0kb) removed
> INFO 177 isolateserver(1375): 45627408kb free
> INFO 178 tools(106): Profiling: Section CleanupTrimming took 0.001 seconds
> INFO 178 isolateserver(381): Waiting for all threads to die...
> INFO 178 isolateserver(390): Done.
> Traceback (most recent call last):
> File "/swarming.client/isolateserver.py", line 2211, in <module>
> sys.exit(main(sys.argv[1:]))
> File "/swarming.client/isolateserver.py", line 2204, in main
> return dispatcher.execute(OptionParserIsolateServer(), args)
> File "/swarming.client/third_party/depot_tools/subcommand.py", line 242, in
> execute
> return command(parser, args[1:])
> File "/swarming.client/isolateserver.py", line 2064, in CMDdownload
> require_command=False)
> File "/swarming.client/isolateserver.py", line 1827, in fetch_isolated
> create_directories(outdir, bundle.files)
> File "/swarming.client/isolateserver.py", line 212, in create_directories
> os.mkdir(os.path.join(base_directory, d))
> OSError: [Errno 17] File exists: '/tmp/dist-test-task_gm4pM/build'
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)