Closed itemsets: 1 5 2 3 (2) 1 3 (3) 5 2 3 (3) 5 2 (4) 3 (4)
That's right.. they are prefixed and repeated for easy indexing.. - Neal On Wed, Apr 14, 2010 at 9:57 PM, Robin Anil <robin.a...@gmail.com> wrote: > I have to clarify this as many have asked this before. Mahout's > Implementation is Top K FPGrowth that finds closed patterns > > if {1} - 3 is a pattern {1, 2}- 3 is another top pattern of item 1 > > {1, 2} - 3 is a closed pattern of {1} -3 as there is no information that the > former gives that that the latter gives extra, this gives significant boost > in speed for the Mahout FP-Growth algorithm. If the closed pattern is not > there, then there should be a problem with implementation otherwise its an > algorithm choice. I am forgetting the paper form which I implemented this. I > will post that when I find it. > > Robin > > > On Thu, Apr 15, 2010 at 4:03 AM, Robert Neumayer <neuma...@idi.ntnu.no>wrote: > >> Hi. >> >> I was trying to test the fpgrowth algorithm on a toy example and I kind of >> fail >> to get the correct results ... so it'd be nice if someone could take a look >> at >> my (admittedly very ad-hoc) code and tell me whether there's something >> wrong with it. I'm running the mahout svn version. >> >> I tried to run the following example (input): >> >> 1 3 4 >> 2 3 5 >> 1 2 3 5 >> 2 5 >> 1 2 3 5 >> >> With the following code: >> public static void main(String[] args) { >> int minSupport = 2; >> int maxHeapSize = 100; >> >> String input = "inputfile"; >> String output = "output"; >> >> FPGrowth<String> fp = new FPGrowth<String>(); >> FileSystem fs = new RawLocalFileSystem(); >> Configuration conf = new Configuration(); >> >> String pattern = "[\\ ]"; >> try { >> fs = FileSystem.get(URI.create(output), conf); >> >> SequenceFile.Writer writer = null; >> writer = new SequenceFile.Writer(fs, conf, new Path(output), >> Text.class, TopKStringPatterns.class); >> >> Charset encoding = Charset.forName("UTF-8"); >> >> List<Pair<String, Long>> generateFList = null; >> generateFList = fp.generateFList(new StringRecordIterator(new >> FileLineIterable(new File(input), encoding, false), >> pattern), minSupport); >> >> StringRecordIterator transactions = null; >> >> transactions = new StringRecordIterator(new FileLineIterable(new >> File(input), encoding, false), pattern); >> >> >> List<Text> keyList = new LinkedList<Text>(); >> List<TopKStringPatterns> valueList = new >> LinkedList<TopKStringPatterns>(); >> >> StringOutputCollector<Text, TopKStringPatterns> collector = new >> StringOutputCollector<Text, TopKStringPatterns>( >> keyList, valueList); >> StringOutputConverter soc = new >> StringOutputConverter(collector); >> >> StatusUpdater updater = new StatusUpdater() { >> �...@override >> public void update(String status) { >> System.out.println(status); >> } >> }; >> >> fp.generateTopKFrequentPatterns(transactions, generateFList, >> minSupport, maxHeapSize, null, soc, updater); >> writer.close(); >> fs.close(); >> >> System.out.println("list.size: " + valueList.size()); >> HashSet<List<String>> unique = new HashSet<List<String>>(); >> for (int i = 0; i < valueList.size(); i++) { >> >> System.out.println(keyList.get(i) + " / " + >> valueList.get(i)); >> Iterator<Pair<List<String>, Long>> iterator = >> valueList.get(i).iterator(); >> while (iterator.hasNext()) { >> >> unique.add(iterator.next().getFirst()); >> } >> } >> // print a bit more clearly >> Iterator<List<String>> iterator = unique.iterator(); >> while (iterator.hasNext()) { >> System.out.println(iterator.next()); >> } >> >> } catch (IOException e) { >> e.printStackTrace(); >> } >> } >> >> I wrote another collector class (I know, that was probably not necessary >> but I >> wanted to use the API a bit): >> >> public class StringOutputCollector<K extends Writable, V extends Writable> >> implements OutputCollector<K, V> { >> >> private final List<K> keyList; >> >> >> private final List<V> valueList; >> >> public StringOutputCollector(List<K> keyList, List<V> valueList) { >> this.keyList = keyList; >> this.valueList = valueList; >> } >> >> �...@override >> public final void collect(K key, V value) throws IOException { >> keyList.add(key); >> valueList.add(value); >> } >> >> } >> >> The output I get is the following: >> 1 / ([1, 3],3), ([1, 2, 3, 5],2) >> 5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) >> 3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2) >> 2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) >> >> But I think there's more frequent itemsets to find in these transactions >> (with >> min support 2): >> >> {1} 3 >> {2} 4 >> {3} 4 >> {5} 4 >> {1, 2} 2 >> {1, 3} 3 >> {1, 5} 2 >> {2, 3} 3 >> {2, 5} 4 >> {3, 5} 3 >> {1, 2, 3} 2 >> {1, 2, 5} 3 >> {1, 3, 5} 2 >> {2, 3, 5} 3 >> {1, 2, 3, 5} 2 >> >> Also, some of them are duplicates with the mahout code. I'm not sure >> whether >> it's supposed to be like that. >> >> I also tried another textbook example and that didn't produce the same >> results >> either. >> >> So am I doing something wrong with the topk selection (that's what I maybe >> could think of as a stupid mistake of mine)? >> >> Any hints would be appreciated. >> >> R >> >