[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (37 issues) Subscriber: pigdaily Key Summary PIG-5373InterRecordReader might skip records if certain sync markers are used https://issues.apache.org/jira/browse/PIG-5373 PIG-5369Add llap-client dependency https://issues.apache.org/jira/browse/PIG-5369 PIG-5360Pig sets working directory of input file systems causes exception thrown https://issues.apache.org/jira/browse/PIG-5360 PIG-5338Prevent deep copy of DataBag into Jython List https://issues.apache.org/jira/browse/PIG-5338 PIG-5323Implement LastInputStreamingOptimizer in Tez https://issues.apache.org/jira/browse/PIG-5323 PIG-5273_SUCCESS file should be created at the end of the job https://issues.apache.org/jira/browse/PIG-5273 PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream https://issues.apache.org/jira/browse/PIG-5267 PIG-5256Bytecode generation for POFilter and POForeach https://issues.apache.org/jira/browse/PIG-5256 PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown NPE in multithread env https://issues.apache.org/jira/browse/PIG-5160 PIG-5115Builtin AvroStorage generates incorrect avro schema when the same pig field name appears in the alias https://issues.apache.org/jira/browse/PIG-5115 PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive set to true https://issues.apache.org/jira/browse/PIG-5106 PIG-5081Can not run pig on spark source code distribution https://issues.apache.org/jira/browse/PIG-5081 PIG-5080Support store alias as spark table https://issues.apache.org/jira/browse/PIG-5080 PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput https://issues.apache.org/jira/browse/PIG-5057 PIG-5029Optimize sort case when data is skewed https://issues.apache.org/jira/browse/PIG-5029 PIG-4926Modify the content of start.xml for spark mode https://issues.apache.org/jira/browse/PIG-4926 PIG-4913Reduce jython function initiation during compilation https://issues.apache.org/jira/browse/PIG-4913 PIG-4849pig on tez will cause tez-ui to crash,because the content from timeline server is too long. https://issues.apache.org/jira/browse/PIG-4849 PIG-4750REPLACE_MULTI should compile Pattern once and reuse it https://issues.apache.org/jira/browse/PIG-4750 PIG-4684Exception should be changed to warning when job diagnostics cannot be fetched https://issues.apache.org/jira/browse/PIG-4684 PIG-4656Improve String serialization and comparator performance in BinInterSedes https://issues.apache.org/jira/browse/PIG-4656 PIG-4598Allow user defined plan optimizer rules https://issues.apache.org/jira/browse/PIG-4598 PIG-4551Partition filter is not pushed down in case of SPLIT https://issues.apache.org/jira/browse/PIG-4551 PIG-4539New PigUnit https://issues.apache.org/jira/browse/PIG-4539 PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException https://issues.apache.org/jira/browse/PIG-4515 PIG-4373Implement PIG-3861 in Tez https://issues.apache.org/jira/browse/PIG-4373 PIG-4323PackageConverter hanging in Spark https://issues.apache.org/jira/browse/PIG-4323 PIG-4313StackOverflowError in LIMIT operation on Spark https://issues.apache.org/jira/browse/PIG-4313 PIG-4251Pig on Storm https://issues.apache.org/jira/browse/PIG-4251 PIG-4002Disable combiner when map-side aggregation is used https://issues.apache.org/jira/browse/PIG-4002 PIG-3952PigStorage accepts '-tagSplit' to return full split information https://issues.apache.org/jira/browse/PIG-3952 PIG-3911Define unique fields with @OutputSchema https://issues.apache.org/jira/browse/PIG-3911 PIG-3877Getting Geo Latitude/Longitude from Address Lines https://issues.apache.org/jira/browse/PIG-3877 PIG-3873Geo distance calculation using Haversine https://issues.apache.org/jira/browse/PIG-3873 PIG-3668COR built-in function when atleast one of the coefficient values is NaN https://issues.apache.org/jira/browse/PIG-3668 PIG-3587add functionality for rolling over dates https://issues.apache.org/jira/browse/PIG-3587 PIG-1804Alow Jython function to implement Algebraic and/or Accumulator interfaces https://issues.apache.org/jira/browse/PIG-1804 You may edit this subscription at: https://issues.apache.org/jira/secure/EditSubscription!default.jspa?subId=16328=12322384
[jira] [Commented] (PIG-5372) SAMPLE/RANDOM(udf) before skewed join failing with NPE
[ https://issues.apache.org/jira/browse/PIG-5372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732373#comment-16732373 ] Daniel Dai commented on PIG-5372: - Wow that's back in 2010 :). I think SkewedPartitioner.setConf is passing conf to MapRedUtil.loadPartitionFileFromLocalCache via PigMapReduce.sJobConf. This is no longer necessary as MapRedUtil.loadPartitionFileFromLocalCache takes mapConf parameter (in a later patch). We can change MapRedUtil.loadPartitionFileFromLocalCache to retrieve fs.file.impl/fs.hdfs.impl from mapConf. Then we don't need overwrite PigMapReduce.sJobConf in SkewedPartitioner.setConf. > SAMPLE/RANDOM(udf) before skewed join failing with NPE > -- > > Key: PIG-5372 > URL: https://issues.apache.org/jira/browse/PIG-5372 > Project: Pig > Issue Type: Bug >Affects Versions: 0.16.0 >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Major > Attachments: pig-5372-v1.patch > > > Sample short code like below > {code} > A = LOAD 'input.txt' AS (a1:int, a2:chararray, a3:int); > B = LOAD 'input.txt' AS (b1:int, b2:chararray, b3:int); > A2 = FOREACH A generate *, RANDOM() as randnum; > D = join A2 by a1, B by b1 using 'skewed' parallel 2; > store D into '$output'; > {code} > Fails with NPE. > {noformat} > 2018-12-12 16:06:04,860 [Dispatcher thread: Central] INFO > org.apache.tez.dag.history.HistoryEventHandler - > [HISTORY][DAG:dag_1544648742542_0001_1][Event:TASK_FINISHED]: > vertexName=scope-55, taskId=task_1544648742542_0001_1_02_00, > startTime=1544648745036, finishTime=1544648764857, timeTaken=19821, > status=KILLED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, > info=[Error: Failure while running > task:org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception > while executing (Name: Local Rearrange[tuple]{int}(false) - scope-29 -> > scope-58 Operator Key: scope-29): > org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception > while executing [POUserFunc (Name: > POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-40 Operator Key: > scope-40) children: null at []]: java.lang.NullPointerException > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:315) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287) > at > org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POLocalRearrangeTez.getNextTuple(POLocalRearrangeTez.java:131) > at > org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:420) > at > org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:282) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:337) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171) > at > org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: > Exception while executing [POUserFunc (Name: > POUserFunc(org.apache.pig.builtin.RANDOM)[double] - scope-40 Operator Key: > scope-40) children: null at []]: java.lang.NullPointerException > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:367) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:408) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:325) > at >
[jira] [Commented] (PIG-5373) InterRecordReader might skip records if certain sync markers are used
[ https://issues.apache.org/jira/browse/PIG-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732007#comment-16732007 ] Nandor Kollar commented on PIG-5373: I have one observation to the patch: to be future-proof, instead CircularFifoBuffer from commons-collection I think we should use CircularFifoQueue from commons-collections4. On one hand CircularFifoBuffer was removed from the latest commons collections code, on the other hand CircularFifoQueue is generic, so we can eliminated iterating through Object items and casting to integer. Be aware of one thing: the semantic of isFull has changed, CircularFifoQueue is never full. The isFull call should be replaced to {{queue.size() == queue.maxSize()}}. > InterRecordReader might skip records if certain sync markers are used > - > > Key: PIG-5373 > URL: https://issues.apache.org/jira/browse/PIG-5373 > Project: Pig > Issue Type: Bug >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: PIG-5373.0.patch > > > Due to bug in InterRecordReader#skipUntilMarkerOrSplitEndOrEOF(), it can > happen that sync markers are not identified while reading the interim binary > file used to hold data between jobs. > In such files sync markers are placed upon writing, which later help during > reading the data. These are random generated and it seems like that in some > rare combinations of markers and data preceding it, they cannot be not found. > This can result in reading through all the bytes (looking for the marker) and > reaching split end or EOF, and extracting no records at all. > This symptom is also observable from JobHistory stats, where if a job is > affected by this issue, will have tasks that have HDFS_BYTES_READ or > FILE_BYTES_READ about equal to the number bytes of the split, but at the same > time having MAP_INPUT_RECORDS=0 > One such (test) example is this: > {code:java} > marker: [-128, -128, 4] , data: [127, -1, 2, -128, -128, -128, 4, 1, 2, > 3]{code} > Due to a bug, such markers whose prefix overlap with the last data chunk are > not seen by the reader. -- This message was sent by Atlassian JIRA (v7.6.3#76005)