Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r150367190
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/HadoopUtils.scala
---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.image
+
+import scala.language.existentials
+import scala.util.Random
+
+import org.apache.commons.io.FilenameUtils
+import org.apache.hadoop.conf.{Configuration, Configured}
+import org.apache.hadoop.fs.{Path, PathFilter}
+import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
+
+import org.apache.spark.sql.SparkSession
+
+private object RecursiveFlag {
+ /**
+ * Sets the spark recursive flag and then restores it.
+ *
+ * @param value Value to set
+ * @param spark Existing spark session
+ * @param f The function to evaluate after setting the flag
+ * @return Returns the evaluation result T of the function
+ */
+ def withRecursiveFlag[T](value: Boolean, spark: SparkSession)(f: => T):
T = {
+ val flagName = FileInputFormat.INPUT_DIR_RECURSIVE
+ val hadoopConf = spark.sparkContext.hadoopConfiguration
+ val old = Option(hadoopConf.get(flagName))
+ hadoopConf.set(flagName, value.toString)
+ try f finally {
+ old match {
+ case Some(v) => hadoopConf.set(flagName, v)
+ case None => hadoopConf.unset(flagName)
+ }
+ }
+ }
+}
+
+/**
+ * Filter that allows loading a fraction of HDFS files.
+ */
+private class SamplePathFilter extends Configured with PathFilter {
--- End diff --
I would prefer determinism since that's a pretty important standard in
Spark. I could imagine either (a) using a file hash with a *global* random
number or (b) using random numbers if we are certain about how PathFilters work.
For (a):
* Why is there a worry about duplicate filenames? Is the full path not
available?
* If you do hash filenames, then I wouldn't generate a random number for
each row. (If you're generating a random number per row, then why not just use
that for sampling and skip the hash?) You could generate a single random
number on the driver and use that in the comparison with each hash.
For (b):
* If we knew how PathFilters were consumed, then we could presumably figure
out a way to make this deterministic just by setting a random seed here. E.g.,
if a new PathFilter instance were instantiated to read each partition, then
that would work. But if PathFilters are shared across reads of multiple
partitions, then partition ordering could cause problems with determinism.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]