[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

srowen Fri, 23 Jun 2017 03:47:17 -0700

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/18288#discussion_r123721406
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala ---
    @@ -91,12 +91,10 @@ private[libsvm] class LibSVMFileFormat extends 
TextBasedFileFormat with DataSour
         val numFeatures: Int = libSVMOptions.numFeatures.getOrElse {
           // Infers number of features if the user doesn't specify (a valid) 
one.
           val dataFiles = files.filterNot(_.getPath.getName startsWith "_")
    -      val path = if (dataFiles.length == 1) {
    -        dataFiles.head.getPath.toUri.toString
    -      } else if (dataFiles.isEmpty) {
    +      val path = if (dataFiles.isEmpty) {
             throw new IOException("No input path specified for libsvm data")
           } else {
    -        throw new IOException("Multiple input paths are not supported for 
libsvm data.")
    +        dataFiles.map(_.getPath.toUri.toString).mkString(",")
    --- End diff --
    
    I see, the point is that it's necessary to scan the whole input if you are 
automatically determining the number of features.
    
    I see why it could be slow to examine all the files, but should it be 
prohibited and fail completely? a warning sounds good, though, if more than one 
file is examined. 
    
    For example, if I have 2 input files, should this really fail?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18288: [SPARK-21066][ML] LibSVM load just one input file

Reply via email to