Github user redbaron commented on the pull request:

    https://github.com/apache/spark/pull/5147#issuecomment-85390241
  
    Thanks for taking a look into this PR.
    
    First about `printPipeContext`, using this function is really awkward, 
because that is not how data processors are normally written. Usually they suck 
data from stdin, spit it out to stdout and their behaviour controlled with 
command line options and/or env variables. `printPipeContext` prepends non-data 
information to a data in a single stream, this adds unnecessary complication as 
data processor must be either wrapped into a shell script,which consumes just 
right ammount of input, figures out runtime switches for data processor and 
then spawns it, or data processor itself must be (re-)written to be aware of 
this feature. Compare it to use of `pipeWith*` functions, when it all comes 
naturally in the form of built up command or env vars map.
    
    Next thing is overall deprecation of `withPartition` commands, I noticed 
this trend and can't say it makes me happy. It is because assuming that the 
only usefull bit of Partition is it's index, but that is not always the case. 
Take for example spark cassandra driver, it has it's own partitioner which 
inherits Partition and contains rich information on the current subset of data, 
like range of vnodes being processed. Same case can be with any inhouse 
partitioners, which not only split the data, but also used to deliver metadata 
about each split right down to executors.
    
    In short what I am trying to achieve is to get an option to access 
Partition instance while processing given partition.  TaskContext provides just 
partitionId and unless there is a way to get partition instance using that id, 
it doesn't solve my problem. I just briefly checked the code and it seems that 
in places where `TaskContextImpl` is instantiated no Partition instances are 
available and it would require some plumbing just to deliver it there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to