Github user dianacarroll commented on the pull request:
https://github.com/apache/spark/pull/1276#issuecomment-48042958
@patrickwendell Before we do that...I was doing more testing on this and
found a performance impact. The call to jrdd.id() takes much longer than I
would have expected it to...on the order of .5-1 seconds! In my use case,
that was a big issue because I was doing an iterative process and
displaying the RDD's ID each time through the loop, and it slowed my
process down 10x. So perhaps the real fix is to figure out why _id isn't
getting set properly in some cases...it at least to check if it's set on
each call, and if it is, return the cache value, and only get the
underlying value if it is unset.
I will give that fix a try next week.
On Fri, Jul 4, 2014 at 1:45 AM, Patrick Wendell <[email protected]>
wrote:
> @dianacarroll <https://github.com/dianacarroll> I think it would make
> sense to also delete the self._id field from the RDD class since it's
> never used.
>
> â
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/1276#issuecomment-48010817>.
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---