rdsr commented on pull request #1267:
URL: https://github.com/apache/iceberg/pull/1267#issuecomment-666208956
> I run MR in local mode whereas this seems to be running in distributed
mode with YARN. I'll have to dig deeper but my guess is that `table` is null
because in that case the calls to `getSplits` and then `getRecordReader`
happens in two different processes.
> @rdsr, what do you think of this approach? One downside is the increase in
size of the serialized splits.
Hi @guilload, @massdosage . I was trying out an alternative way of passing
the required parameters. It seems instead of setting the `TABLE_PATH`, `SCHEMA`
etc in `InputFormat#getsplits` method, which is not being propagate to record
readers on worker nodes, I tried setting the required parameters in
`org.apache.hadoop.hive.ql.metadata.HiveStorageHandler#configureInputJobProperties`.
From that method's javadoc
> /**
> * This method is called to allow the StorageHandlers the chance
> * to populate the JobContext.getConfiguration() with properties that
> * maybe be needed by the handler's bundled artifacts (ie InputFormat,
SerDe, etc).
it looks like that it maybe the right method to do what we are trying to
achieve. Below are my modifications.
https://github.com/apache/iceberg/compare/master...rdsr:alternative_conf?expand=1
I've yet to try this out on a real cluster to see if this works in a
distributed YARN mode though. I plan to do this tomorrow.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]