Github user marmbrus commented on the pull request:
https://github.com/apache/spark/pull/4246#issuecomment-91941342
Thanks for addressing the comments. As I look at this closer I wonder if
we aren't approaching this in the wrong way. If this is a hadoop option that
FileInputFormat supports, why aren't users just setting it in the hadoop
configuration? (i.e. `sc.hadoopConfiguration.set("mapreduce.input.pathFilter",
...)`). As it stands now, it is sprinkling new logic all over that is reading
from some random hadoop config. This seems very likely to lead to errors in
the future. If we aren't properly propagating the hadoop config that is
another thing that should be fixed separately.
Unless there is something that I am missing, I suggest that we close this
issue and take the hadoop configuration approach.
If you do think this is the right approach, we can discuss further, but I
have some feeback on the PR in general:
- The indentation is still off in several places. Please look at the
coding style guide.
- There is a lot of duplicated logic (some of which is wrong). It would
be better to isolate these things, so if we need to fix it in the future we
can. For example I see this patter a lot:
```scala
val filterClassName =
sqlContext.getConf(NewFileInputFormat.PATHFILTER_CLASS, "")
if (filterClassName.nonEmpty) {
val filterClazz = Class.forName(filterClassName).asInstanceOf[Class[_
<: PathFilter]]
```
However `Class.forName` is probably not using a classloader that can see
user code.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]