Re: FPGrowth maybe not producing the right rules

Robin Anil Fri, 16 Apr 2010 01:47:42 -0700

The definitions are same. But I used some ideas from there (all except the
length check)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.9324
 <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.9324>


and added FP-Bonsai to do faster mining.

<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.9324>Robin

On Thu, Apr 15, 2010 at 12:19 PM, Robert Neumayer <neuma...@idi.ntnu.no>wrote:

>  I have to clarify this as many have asked this before. Mahout's
>> Implementation is  Top K FPGrowth that finds closed patterns
>>
>
> Ok, I didn't know that, thanks for the clarification.
>
> There's a paper explaining exactly this (is it this one you used for
> implementing maybe?):
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.1102
>
> R
>
>
> On Thu, 15 Apr 2010, Robin Anil wrote:
>
>
>> if {1} - 3 is a pattern {1, 2}- 3 is another top pattern  of item 1
>>
>> {1, 2} - 3 is a closed pattern of {1} -3 as there is no information that
>> the
>> former gives that that the latter gives extra, this gives significant
>> boost
>> in speed for the Mahout FP-Growth algorithm. If the closed pattern is not
>> there, then there should be a problem with implementation otherwise its an
>> algorithm choice. I am forgetting the paper form which I implemented this.
>> I
>> will post that when I find it.
>>
>> Robin
>>
>>
>> On Thu, Apr 15, 2010 at 4:03 AM, Robert Neumayer <neuma...@idi.ntnu.no
>> >wrote:
>>
>>  Hi.
>>>
>>> I was trying to test the fpgrowth algorithm on a toy example and I kind
>>> of
>>> fail
>>> to get the correct results ... so it'd be nice if someone could take a
>>> look
>>> at
>>> my (admittedly very ad-hoc) code and tell me whether there's something
>>> wrong with it. I'm running the mahout svn version.
>>>
>>> I tried to run the following example (input):
>>>
>>> 1 3 4
>>> 2 3 5
>>> 1 2 3 5
>>> 2 5
>>> 1 2 3 5
>>>
>>> With the following code:
>>>  public static void main(String[] args) {
>>>       int minSupport = 2;
>>>       int maxHeapSize = 100;
>>>
>>>       String input = "inputfile";
>>>       String output = "output";
>>>
>>>       FPGrowth<String> fp = new FPGrowth<String>();
>>>       FileSystem fs = new RawLocalFileSystem();
>>>       Configuration conf = new Configuration();
>>>
>>>       String pattern = "[\\ ]";
>>>       try {
>>>           fs = FileSystem.get(URI.create(output), conf);
>>>
>>>           SequenceFile.Writer writer = null;
>>>           writer = new SequenceFile.Writer(fs, conf, new Path(output),
>>> Text.class, TopKStringPatterns.class);
>>>
>>>           Charset encoding = Charset.forName("UTF-8");
>>>
>>>           List<Pair<String, Long>> generateFList = null;
>>>           generateFList = fp.generateFList(new StringRecordIterator(new
>>> FileLineIterable(new File(input), encoding, false),
>>>                   pattern), minSupport);
>>>
>>>           StringRecordIterator transactions = null;
>>>
>>>           transactions = new StringRecordIterator(new
>>> FileLineIterable(new
>>> File(input), encoding, false), pattern);
>>>
>>>
>>>           List<Text> keyList = new LinkedList<Text>();
>>>           List<TopKStringPatterns> valueList = new
>>> LinkedList<TopKStringPatterns>();
>>>
>>>           StringOutputCollector<Text, TopKStringPatterns> collector = new
>>> StringOutputCollector<Text, TopKStringPatterns>(
>>>                   keyList, valueList);
>>>           StringOutputConverter soc = new
>>> StringOutputConverter(collector);
>>>
>>>           StatusUpdater updater = new StatusUpdater() {
>>>               @Override
>>>               public void update(String status) {
>>>                   System.out.println(status);
>>>               }
>>>           };
>>>
>>>           fp.generateTopKFrequentPatterns(transactions, generateFList,
>>> minSupport, maxHeapSize, null, soc, updater);
>>>           writer.close();
>>>           fs.close();
>>>
>>>           System.out.println("list.size: " + valueList.size());
>>>           HashSet<List<String>> unique = new HashSet<List<String>>();
>>>           for (int i = 0; i < valueList.size(); i++) {
>>>
>>>               System.out.println(keyList.get(i) + " / " +
>>> valueList.get(i));
>>>               Iterator<Pair<List<String>, Long>> iterator =
>>> valueList.get(i).iterator();
>>>               while (iterator.hasNext()) {
>>>
>>>                   unique.add(iterator.next().getFirst());
>>>               }
>>>           }
>>>           // print a bit more clearly
>>>           Iterator<List<String>> iterator = unique.iterator();
>>>           while (iterator.hasNext()) {
>>>               System.out.println(iterator.next());
>>>           }
>>>
>>>       } catch (IOException e) {
>>>           e.printStackTrace();
>>>       }
>>>   }
>>>
>>> I wrote another collector class (I know, that was probably not necessary
>>> but I
>>> wanted to use the API a bit):
>>>
>>> public class StringOutputCollector<K extends Writable, V extends
>>> Writable>
>>> implements OutputCollector<K, V> {
>>>
>>>   private final List<K> keyList;
>>>
>>>
>>>   private final List<V> valueList;
>>>
>>>   public StringOutputCollector(List<K> keyList, List<V> valueList) {
>>>       this.keyList = keyList;
>>>       this.valueList = valueList;
>>>   }
>>>
>>>   @Override
>>>   public final void collect(K key, V value) throws IOException {
>>>       keyList.add(key);
>>>       valueList.add(value);
>>>   }
>>>
>>> }
>>>
>>> The output I get is the following:
>>> 1 / ([1, 3],3), ([1, 2, 3, 5],2)
>>> 5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
>>> 3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2)
>>> 2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
>>>
>>> But I think there's more frequent itemsets to find in these transactions
>>> (with
>>> min support 2):
>>>
>>> {1}     3
>>> {2}     4
>>> {3}     4
>>> {5}     4
>>> {1, 2}  2
>>> {1, 3}  3
>>> {1, 5}  2
>>> {2, 3}  3
>>> {2, 5}  4
>>> {3, 5}  3
>>> {1, 2, 3}       2
>>> {1, 2, 5}       3
>>> {1, 3, 5}       2
>>> {2, 3, 5}       3
>>> {1, 2, 3, 5}    2
>>>
>>> Also, some of them are duplicates with the mahout code. I'm not sure
>>> whether
>>> it's supposed to be like that.
>>>
>>> I also tried another textbook example and that didn't produce the same
>>> results
>>> either.
>>>
>>> So am I doing something wrong with the topk selection (that's what I
>>> maybe
>>> could think of as a stupid mistake of mine)?
>>>
>>> Any hints would be appreciated.
>>>
>>> R
>>>
>>>
>>
>
> --
> Robert Neumayer
> Norwegian University of Science and Technology (NTNU)
> Department of Computer and Information Science
> Sem Sælands vei 7-9
> NO-7491 Trondheim
> http://www.idi.ntnu.no/~neumayer
>

Re: FPGrowth maybe not producing the right rules

Reply via email to