shardulm94 opened a new pull request #1582:
URL: https://github.com/apache/iceberg/pull/1582


   `InputSplit.getLocations()` can be called on mappers in some cases. Since 
both `locations` and `conf` are transient, the current code produces an NPE as 
`conf` is null on the mappers.
   
   The value of `locations` is not really relevant on the mappers since the 
tasks have already been distributed. So here we just return `ANYWHERE` when the 
`conf` is null. We can probably be more accurate by serializing the `locations` 
values if set, but I don't think its worth it.
   
   ```
   java.lang.Exception: java.lang.NullPointerException
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
   Caused by: java.lang.NullPointerException
       at 
org.apache.iceberg.mr.mapreduce.IcebergSplit.getLocations(IcebergSplit.java:69)
       .
       .
       .
       at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit.toString(PigSplit.java:465)
       at java.lang.String.valueOf(String.java:2994)
       at java.lang.StringBuilder.append(StringBuilder.java:131)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
       at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
       at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
   ```
   
   I discovered this issue while testing our internal Pig `LoadFunc` 
implementation which works with multiple `InputFormat`s. Iceberg's `LoadFunc` 
implementation in `iceberg-pig` does not have this issue as it contains its own 
`InputSplit` implementation which does not provide location information.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to