Github user redbaron commented on the pull request:
https://github.com/apache/spark/pull/5147#issuecomment-85390241
Thanks for taking a look into this PR.
First about `printPipeContext`, using this function is really awkward,
because that is not how data processors are normally written. Usually they suck
data from stdin, spit it out to stdout and their behaviour controlled with
command line options and/or env variables. `printPipeContext` prepends non-data
information to a data in a single stream, this adds unnecessary complication as
data processor must be either wrapped into a shell script,which consumes just
right ammount of input, figures out runtime switches for data processor and
then spawns it, or data processor itself must be (re-)written to be aware of
this feature. Compare it to use of `pipeWith*` functions, when it all comes
naturally in the form of built up command or env vars map.
Next thing is overall deprecation of `withPartition` commands, I noticed
this trend and can't say it makes me happy. It is because assuming that the
only usefull bit of Partition is it's index, but that is not always the case.
Take for example spark cassandra driver, it has it's own partitioner which
inherits Partition and contains rich information on the current subset of data,
like range of vnodes being processed. Same case can be with any inhouse
partitioners, which not only split the data, but also used to deliver metadata
about each split right down to executors.
In short what I am trying to achieve is to get an option to access
Partition instance while processing given partition. TaskContext provides just
partitionId and unless there is a way to get partition instance using that id,
it doesn't solve my problem. I just briefly checked the code and it seems that
in places where `TaskContextImpl` is instantiated no Partition instances are
available and it would require some plumbing just to deliver it there.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]