Hi all,
I'm working on a small project for university and I have some question
about how to implement it. Maybe you could give me some hints....
I have a directory that contains around 1 million HTML files. Basically,
I just want to read each file entirely into a String and parse it with
JSoup in a Mapper. Do we have a InputFormat that can be used for this
use case or do I have to implement my own FileInputFormat for that? :/
In general: Do you think creating InputSplits of the directory will work
properly with 1 million FileStatus'es?
Regards,
Timo
- Process directories containing large number of files Timo Walther
-