Hi all! dataset for my project turned out to be huge, and my teacher told me I have to use hadoop framework. I am stuggling to understand how to make this, but honestly, i cant move from dead point :( I am not that good programmer and I can not find any of classmates who knows hadoop to help me out...I will excuse myself ahead for writing a lot, but i really dont know what else to do.
So I have this implemented with some Java concurrency features, but it is too slow for this set size - algorithm takes one folder, and in all its subfolders, finds .txt files with specific name - It queries lucene index and pupulates a list of most frequent terms - Parses the .txt files line by line, and searches for a match between every line's third word and if there is match in the list - In case that there was match between some list term and third word from some line in txt, the entire line is stored in buffer and afterwards buffers are written to output txt files. So final result are txt files, which are of identical structure as original ones, except that they are smaller, since they contain only matching lines. I am attaching files 1) TextFileAnalyzer, is a java callable object which takes txt file and list and does the parsing and comparison. 2) MainAnalyzer.java, goes through main folder, gets txt files, and gives them to TextFileAnalyzer callables, together with list it gets from lucene index. I am sorry for asking for so much help, but i really have nobody to ask and i tried to grasp how to do this, but with this brain and time, its out of my reach. Also, I also read that it is not possible to query lucene index on hadoop????? I will very much apreciate all the help, it is very much needed. Thank you in advance! Aida http://lucene.472066.n3.nabble.com/file/n890360/TextFileAnalyzer.java TextFileAnalyzer.java http://lucene.472066.n3.nabble.com/file/n890360/MainAnalyzer.java MainAnalyzer.java -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-implement-this-with-hadoop-guidelines-PLEASE-hadoop-beginner-tp890360p890360.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
