[ 
https://issues.apache.org/jira/browse/PIG-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156162#comment-16156162
 ] 

Rohini Palaniswamy commented on PIG-5290:
-----------------------------------------

Few comments:
  1) Can you compile the Pattern and reuse it instead of calling matches every 
time? That will be more cost effective.
  2) If there is still a collision during writing temp file due to random 
number collision (we have run into the rare scenario where different jobs had 
random numbers collide during temp dir creation) or renaming to final file name 
(more likely than first scenario) you will still end up with an error. First 
case is really rare and so we can skip for now. But we should handle the rename 
problem by checking for FileAlreadyExistsException in a separate try/catch 
block and verify there if the existing file size is same as the temp file and 
if not thrown an error.
  3) Regarding reverting to the listStatus behavior, I checked PIG-3815-3.patch 
and before the change looks like it still did check to see if filename matched.
https://github.com/apache/pig/blob/c2aedcc66486ddc721a32dc4984547f049aa5541/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java#L1638
     I don't think we should be doing the listStatus check you are doing and 
pick any non-tmp file in that directory. Might end up with wrong file in case 
of checksum collisions. Comment 1) would be not necessary if listStatus 
behavior is reverted.
  4) We generally expect unit tests to be added with every patch. But in this 
case simulating a race condition is hard, so I would say we can skip and rely 
on the older tests. But on checking realized that there is no actually no test 
for this feature. Can you add -Dpig.user.cache.enabled=true to java_params of 
UdfDistributedCache and MonitoredUDF tests in nightly.conf?

> User Cache upload contention can cause job failures
> ---------------------------------------------------
>
>                 Key: PIG-5290
>                 URL: https://issues.apache.org/jira/browse/PIG-5290
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.13.0
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>         Attachments: PIG-5290.patch
>
>
> We recently enabled the User Cache (PIG-2672) feature and found that 
> occasionally jobs would fail because of contention when uploading JARs into 
> the cache. Although the cache is designed to be fail-safe, i.e. to fall back 
> to normal behavior if anything goes wrong by catching all {{IOException}}, 
> the portion of code which closes the output stream _is not_ wrapped within a 
> {{try}} statement and thus an exception during the closing of that stream 
> causes the entire job to fail. If multiple jobs are attempting to upload the 
> same JAR failure simultaneously, the contention can cause this close 
> statement to fail.
> The current strategy also has two other flaws. First, consider the scenario 
> where job A begins uploading jar X. Job B also needs jar X, sees that the 
> file exists, and launches its tasks. Yet, job A has not yet finished 
> uploading jar X (perhaps it is large). So, the tasks are localizing a 
> half-completed version of jar X. Second, the original design allowed for the 
> same JAR (identical contents) to be shared between jobs even if a different 
> name was used. In PIG-3815, however, this ability was removed, and now JARs 
> are only shared if they have the same name.
> I propose we solve all of these issues simultaneously by returning to the 
> listStatus based behavior (used prior to PIG-3815), but filter out entries 
> ending in {{.tmp}}. When uploading, upload to {{randomNumber.tmp}}, then once 
> the file is completed, do a rename to the original name of the JAR file. This 
> ensures that incomplete files are never in a location that would be accessed 
> by other jobs, and the only write operation accessing a shared path is a 
> single rename operation.
> An alternative design is to use a single canonicalized name for all JAR files 
> (they will still be unique since they are inside of directories based on 
> their SHA1). Upload to a tmp file as previously described, then rename to the 
> canonical name. This removes the need to do a listStatus call; however it 
> will result in classpaths that are human unreadable since the name of the JAR 
> file has been lost. I think it's worth it from a debugging standpoint to go 
> with the first design.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to