Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/12088#issuecomment-204036835
  
    @mengxr For the extra lines added, we're planning to remove 
`buildInternalScan` after finishing migrating all the `HadoopFsRelation` data 
sources, and we'll do the cleanup then. Separating Tungsten internals and 
LibSVM format parsing from each other is a good point. And it's true that we 
may scan the data twice for computing total number of features, since now we 
can't cache the original RDD because it;s constructed in a different way from 
the final `FileScanRDD`. Forgot to mention this in the PR title. Haven't 
figured out a good solution for this problem yet. On the other hand, the 
original code always cache the RDD in memory, does this imply we never intend 
to use the LibSVM data source to load large datasets that don't fit in memory? 
If that's true, we may want to special case LibSVM since the `FileScanRDD` code 
path may not bring much performance improvements to this data source.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to