The definitions are same. But I used some ideas from there (all except the length check) http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.9324 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.9324>
and added FP-Bonsai to do faster mining. <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.9324>Robin On Thu, Apr 15, 2010 at 12:19 PM, Robert Neumayer <neuma...@idi.ntnu.no>wrote: > I have to clarify this as many have asked this before. Mahout's >> Implementation is Top K FPGrowth that finds closed patterns >> > > Ok, I didn't know that, thanks for the clarification. > > There's a paper explaining exactly this (is it this one you used for > implementing maybe?): > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.1102 > > R > > > On Thu, 15 Apr 2010, Robin Anil wrote: > > >> if {1} - 3 is a pattern {1, 2}- 3 is another top pattern of item 1 >> >> {1, 2} - 3 is a closed pattern of {1} -3 as there is no information that >> the >> former gives that that the latter gives extra, this gives significant >> boost >> in speed for the Mahout FP-Growth algorithm. If the closed pattern is not >> there, then there should be a problem with implementation otherwise its an >> algorithm choice. I am forgetting the paper form which I implemented this. >> I >> will post that when I find it. >> >> Robin >> >> >> On Thu, Apr 15, 2010 at 4:03 AM, Robert Neumayer <neuma...@idi.ntnu.no >> >wrote: >> >> Hi. >>> >>> I was trying to test the fpgrowth algorithm on a toy example and I kind >>> of >>> fail >>> to get the correct results ... so it'd be nice if someone could take a >>> look >>> at >>> my (admittedly very ad-hoc) code and tell me whether there's something >>> wrong with it. I'm running the mahout svn version. >>> >>> I tried to run the following example (input): >>> >>> 1 3 4 >>> 2 3 5 >>> 1 2 3 5 >>> 2 5 >>> 1 2 3 5 >>> >>> With the following code: >>> public static void main(String[] args) { >>> int minSupport = 2; >>> int maxHeapSize = 100; >>> >>> String input = "inputfile"; >>> String output = "output"; >>> >>> FPGrowth<String> fp = new FPGrowth<String>(); >>> FileSystem fs = new RawLocalFileSystem(); >>> Configuration conf = new Configuration(); >>> >>> String pattern = "[\\ ]"; >>> try { >>> fs = FileSystem.get(URI.create(output), conf); >>> >>> SequenceFile.Writer writer = null; >>> writer = new SequenceFile.Writer(fs, conf, new Path(output), >>> Text.class, TopKStringPatterns.class); >>> >>> Charset encoding = Charset.forName("UTF-8"); >>> >>> List<Pair<String, Long>> generateFList = null; >>> generateFList = fp.generateFList(new StringRecordIterator(new >>> FileLineIterable(new File(input), encoding, false), >>> pattern), minSupport); >>> >>> StringRecordIterator transactions = null; >>> >>> transactions = new StringRecordIterator(new >>> FileLineIterable(new >>> File(input), encoding, false), pattern); >>> >>> >>> List<Text> keyList = new LinkedList<Text>(); >>> List<TopKStringPatterns> valueList = new >>> LinkedList<TopKStringPatterns>(); >>> >>> StringOutputCollector<Text, TopKStringPatterns> collector = new >>> StringOutputCollector<Text, TopKStringPatterns>( >>> keyList, valueList); >>> StringOutputConverter soc = new >>> StringOutputConverter(collector); >>> >>> StatusUpdater updater = new StatusUpdater() { >>> @Override >>> public void update(String status) { >>> System.out.println(status); >>> } >>> }; >>> >>> fp.generateTopKFrequentPatterns(transactions, generateFList, >>> minSupport, maxHeapSize, null, soc, updater); >>> writer.close(); >>> fs.close(); >>> >>> System.out.println("list.size: " + valueList.size()); >>> HashSet<List<String>> unique = new HashSet<List<String>>(); >>> for (int i = 0; i < valueList.size(); i++) { >>> >>> System.out.println(keyList.get(i) + " / " + >>> valueList.get(i)); >>> Iterator<Pair<List<String>, Long>> iterator = >>> valueList.get(i).iterator(); >>> while (iterator.hasNext()) { >>> >>> unique.add(iterator.next().getFirst()); >>> } >>> } >>> // print a bit more clearly >>> Iterator<List<String>> iterator = unique.iterator(); >>> while (iterator.hasNext()) { >>> System.out.println(iterator.next()); >>> } >>> >>> } catch (IOException e) { >>> e.printStackTrace(); >>> } >>> } >>> >>> I wrote another collector class (I know, that was probably not necessary >>> but I >>> wanted to use the API a bit): >>> >>> public class StringOutputCollector<K extends Writable, V extends >>> Writable> >>> implements OutputCollector<K, V> { >>> >>> private final List<K> keyList; >>> >>> >>> private final List<V> valueList; >>> >>> public StringOutputCollector(List<K> keyList, List<V> valueList) { >>> this.keyList = keyList; >>> this.valueList = valueList; >>> } >>> >>> @Override >>> public final void collect(K key, V value) throws IOException { >>> keyList.add(key); >>> valueList.add(value); >>> } >>> >>> } >>> >>> The output I get is the following: >>> 1 / ([1, 3],3), ([1, 2, 3, 5],2) >>> 5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) >>> 3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2) >>> 2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) >>> >>> But I think there's more frequent itemsets to find in these transactions >>> (with >>> min support 2): >>> >>> {1} 3 >>> {2} 4 >>> {3} 4 >>> {5} 4 >>> {1, 2} 2 >>> {1, 3} 3 >>> {1, 5} 2 >>> {2, 3} 3 >>> {2, 5} 4 >>> {3, 5} 3 >>> {1, 2, 3} 2 >>> {1, 2, 5} 3 >>> {1, 3, 5} 2 >>> {2, 3, 5} 3 >>> {1, 2, 3, 5} 2 >>> >>> Also, some of them are duplicates with the mahout code. I'm not sure >>> whether >>> it's supposed to be like that. >>> >>> I also tried another textbook example and that didn't produce the same >>> results >>> either. >>> >>> So am I doing something wrong with the topk selection (that's what I >>> maybe >>> could think of as a stupid mistake of mine)? >>> >>> Any hints would be appreciated. >>> >>> R >>> >>> >> > > -- > Robert Neumayer > Norwegian University of Science and Technology (NTNU) > Department of Computer and Information Science > Sem Sælands vei 7-9 > NO-7491 Trondheim > http://www.idi.ntnu.no/~neumayer >