GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/18556

    [SPARK-21326][SPARK-21066][ML] Use TextFileFormat in LibSVMFileFormat and 
allow multiple input paths for determining numFeatures

    ## What changes were proposed in this pull request?
    
    This is related with 
[SPARK-19918](https://issues.apache.org/jira/browse/SPARK-19918) and 
[SPARK-18362](https://issues.apache.org/jira/browse/SPARK-18362).
    
    This PR proposes to use `TextFileFormat` and allow multiple input paths 
(but with a warning) when determining the number of features in LibSVM data 
source via an extra scan.
    
    There are three points here:
    
    - The main advantage of this change should be to remove file-listing 
bottlenecks in driver side.
    
    - Another advantage is ones from using `FileScanRDD`. For example, I guess 
we can use `spark.sql.files.ignoreCorruptFiles` option when determining the 
number of features.
    
    - We can unify the schema inference code path in text based data sources. 
This is also a preparation for 
[SPARK-21289](https://issues.apache.org/jira/browse/SPARK-21289).
    
    ## How was this patch tested?
    
    Unit tests in `LibSVMRelationSuite`.
    
    Closes #18288

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark libsvm-schema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18556.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18556
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to