Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2017-09-14 Thread Eyal Allweil via Review Board


> On Nov. 14, 2014, 2:40 a.m., Matthew Hayes wrote:
> > datafu-pig/src/main/macros/nlp/tf_idf.pig
> > Lines 72 (patched)
> > 
> >
> > Shouldn't this be SUM?

As far as I can tell, it's OK that this is COUNT, if we're counting documents 
(and as I understand it TF-IDF we're dividing by documents for the IDF part, 
not actual occurences.


- Eyal


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61348
---


On Nov. 10, 2014, 8:33 p.m., Russell Jurney wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27820/
> ---
> 
> (Updated Nov. 10, 2014, 8:33 p.m.)
> 
> 
> Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
> Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> ---
> 
> DATAFU-61 - Add TF-IDF Macro to DataFu
> 
> 
> Diffs
> -
> 
>   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
>   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/27820/diff/1/
> 
> 
> Testing
> ---
> 
> Works for me, but testing not automated. See 
> https://issues.apache.org/jira/browse/DATAFU-61
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>



Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-18 Thread Matthew Hayes


 On Nov. 14, 2014, 12:46 a.m., Matthew Hayes wrote:
  datafu-pig/src/main/macros/nlp/tf_idf.pig, line 27
  https://reviews.apache.org/r/27820/diff/1/?file=756916#file756916line27
 
  We should give some thought towards how to best namespace this and 
  other macros.
  
  Although a bit wordy, this would avoid conflicts in the future:
  
  DataFu_TFIDF_OpenNlp_Simple
  
  If we supported a maximum entropy version later we could have:
  
  DataFu_TFIDF_OpenNlp_MaxEnt
  
  I am open to ideas :)
  
  We may also want to have a version of the macro in the future where the 
  tokens can be fed in, without tokenization of raw text.
 
 Russell Jurney wrote:
 This sounds pretty reasonable. Actually, why don't I make the sample UDF 
 configurable? An option of the Macro.

It might be hard to parameterize all the NLP options.  For example, the 
TokenizeME takes a parameter for the tokenization data.  Unfortunately macros 
cannot include other macros so we'd have to either copy and paste a lot or come 
up with some build mechanism to templatize these.


- Matthew


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61350
---


On Nov. 10, 2014, 6:33 p.m., Russell Jurney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27820/
 ---
 
 (Updated Nov. 10, 2014, 6:33 p.m.)
 
 
 Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
 Sam Shah.
 
 
 Repository: datafu
 
 
 Description
 ---
 
 DATAFU-61 - Add TF-IDF Macro to DataFu
 
 
 Diffs
 -
 
   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27820/diff/
 
 
 Testing
 ---
 
 Works for me, but testing not automated. See 
 https://issues.apache.org/jira/browse/DATAFU-61
 
 
 Thanks,
 
 Russell Jurney
 




Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-18 Thread Russell Jurney


 On Nov. 14, 2014, 12:46 a.m., Matthew Hayes wrote:
  datafu-pig/src/main/macros/nlp/tf_idf.pig, line 27
  https://reviews.apache.org/r/27820/diff/1/?file=756916#file756916line27
 
  We should give some thought towards how to best namespace this and 
  other macros.
  
  Although a bit wordy, this would avoid conflicts in the future:
  
  DataFu_TFIDF_OpenNlp_Simple
  
  If we supported a maximum entropy version later we could have:
  
  DataFu_TFIDF_OpenNlp_MaxEnt
  
  I am open to ideas :)
  
  We may also want to have a version of the macro in the future where the 
  tokens can be fed in, without tokenization of raw text.
 
 Russell Jurney wrote:
 This sounds pretty reasonable. Actually, why don't I make the sample UDF 
 configurable? An option of the Macro.
 
 Matthew Hayes wrote:
 It might be hard to parameterize all the NLP options.  For example, the 
 TokenizeME takes a parameter for the tokenization data.  Unfortunately macros 
 cannot include other macros so we'd have to either copy and paste a lot or 
 come up with some build mechanism to templatize these.

You're right about the different interfaces to the tokenizers. Who wrote those 
things, anyway? :) I'll use the name you suggested. 

However, macros can call other macros. Is that what you meant by include? I did 
a post on this here: 
http://datasyndrome.com/post/17186084960/the-power-of-pig-macros


- Russell


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61350
---


On Nov. 10, 2014, 6:33 p.m., Russell Jurney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27820/
 ---
 
 (Updated Nov. 10, 2014, 6:33 p.m.)
 
 
 Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
 Sam Shah.
 
 
 Repository: datafu
 
 
 Description
 ---
 
 DATAFU-61 - Add TF-IDF Macro to DataFu
 
 
 Diffs
 -
 
   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27820/diff/
 
 
 Testing
 ---
 
 Works for me, but testing not automated. See 
 https://issues.apache.org/jira/browse/DATAFU-61
 
 
 Thanks,
 
 Russell Jurney
 




Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-18 Thread Matthew Hayes


 On Nov. 14, 2014, 12:46 a.m., Matthew Hayes wrote:
  datafu-pig/src/main/macros/nlp/tf_idf.pig, line 27
  https://reviews.apache.org/r/27820/diff/1/?file=756916#file756916line27
 
  We should give some thought towards how to best namespace this and 
  other macros.
  
  Although a bit wordy, this would avoid conflicts in the future:
  
  DataFu_TFIDF_OpenNlp_Simple
  
  If we supported a maximum entropy version later we could have:
  
  DataFu_TFIDF_OpenNlp_MaxEnt
  
  I am open to ideas :)
  
  We may also want to have a version of the macro in the future where the 
  tokens can be fed in, without tokenization of raw text.
 
 Russell Jurney wrote:
 This sounds pretty reasonable. Actually, why don't I make the sample UDF 
 configurable? An option of the Macro.
 
 Matthew Hayes wrote:
 It might be hard to parameterize all the NLP options.  For example, the 
 TokenizeME takes a parameter for the tokenization data.  Unfortunately macros 
 cannot include other macros so we'd have to either copy and paste a lot or 
 come up with some build mechanism to templatize these.
 
 Russell Jurney wrote:
 You're right about the different interfaces to the tokenizers. Who wrote 
 those things, anyway? :) I'll use the name you suggested. 
 
 However, macros can call other macros. Is that what you meant by include? 
 I did a post on this here: 
 http://datasyndrome.com/post/17186084960/the-power-of-pig-macros

Oh didn't know that macros could call other macros.  That's what I meant 
instead of include.  I thought they couldn't call other macros but I guess I am 
just not up to date on the documentation :)  If that's the case that it works, 
then you can put the majority of the code in a common macro that the others 
call.


- Matthew


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61350
---


On Nov. 10, 2014, 6:33 p.m., Russell Jurney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27820/
 ---
 
 (Updated Nov. 10, 2014, 6:33 p.m.)
 
 
 Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
 Sam Shah.
 
 
 Repository: datafu
 
 
 Description
 ---
 
 DATAFU-61 - Add TF-IDF Macro to DataFu
 
 
 Diffs
 -
 
   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27820/diff/
 
 
 Testing
 ---
 
 Works for me, but testing not automated. See 
 https://issues.apache.org/jira/browse/DATAFU-61
 
 
 Thanks,
 
 Russell Jurney
 




Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-17 Thread Russell Jurney


 On Nov. 14, 2014, 12:46 a.m., Matthew Hayes wrote:
  datafu-pig/src/main/macros/nlp/tf_idf.pig, line 27
  https://reviews.apache.org/r/27820/diff/1/?file=756916#file756916line27
 
  We should give some thought towards how to best namespace this and 
  other macros.
  
  Although a bit wordy, this would avoid conflicts in the future:
  
  DataFu_TFIDF_OpenNlp_Simple
  
  If we supported a maximum entropy version later we could have:
  
  DataFu_TFIDF_OpenNlp_MaxEnt
  
  I am open to ideas :)
  
  We may also want to have a version of the macro in the future where the 
  tokens can be fed in, without tokenization of raw text.

This sounds pretty reasonable. Actually, why don't I make the sample UDF 
configurable? An option of the Macro.


- Russell


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61350
---


On Nov. 10, 2014, 6:33 p.m., Russell Jurney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27820/
 ---
 
 (Updated Nov. 10, 2014, 6:33 p.m.)
 
 
 Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
 Sam Shah.
 
 
 Repository: datafu
 
 
 Description
 ---
 
 DATAFU-61 - Add TF-IDF Macro to DataFu
 
 
 Diffs
 -
 
   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27820/diff/
 
 
 Testing
 ---
 
 Works for me, but testing not automated. See 
 https://issues.apache.org/jira/browse/DATAFU-61
 
 
 Thanks,
 
 Russell Jurney
 




Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-13 Thread Matthew Hayes

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61348
---



datafu-pig/src/main/macros/nlp/tf_idf.pig
https://reviews.apache.org/r/27820/#comment102923

Shouldn't this be SUM?


- Matthew Hayes


On Nov. 10, 2014, 6:33 p.m., Russell Jurney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27820/
 ---
 
 (Updated Nov. 10, 2014, 6:33 p.m.)
 
 
 Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
 Sam Shah.
 
 
 Repository: datafu
 
 
 Description
 ---
 
 DATAFU-61 - Add TF-IDF Macro to DataFu
 
 
 Diffs
 -
 
   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27820/diff/
 
 
 Testing
 ---
 
 Works for me, but testing not automated. See 
 https://issues.apache.org/jira/browse/DATAFU-61
 
 
 Thanks,
 
 Russell Jurney
 




Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-13 Thread Matthew Hayes

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61350
---



datafu-pig/src/main/macros/nlp/tf_idf.pig
https://reviews.apache.org/r/27820/#comment102924

Overall I like the simplicity of this macro.  It seems really easy to use.  
I would add a note on how tokenization is done (i.e. using TokenizeSimple, 
which uses character classes) and that it uses augmented term freq.



datafu-pig/src/main/macros/nlp/tf_idf.pig
https://reviews.apache.org/r/27820/#comment102925

We should give some thought towards how to best namespace this and other 
macros.

Although a bit wordy, this would avoid conflicts in the future:

DataFu_TFIDF_OpenNlp_Simple

If we supported a maximum entropy version later we could have:

DataFu_TFIDF_OpenNlp_MaxEnt

I am open to ideas :)

We may also want to have a version of the macro in the future where the 
tokens can be fed in, without tokenization of raw text.


- Matthew Hayes


On Nov. 10, 2014, 6:33 p.m., Russell Jurney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/27820/
 ---
 
 (Updated Nov. 10, 2014, 6:33 p.m.)
 
 
 Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
 Sam Shah.
 
 
 Repository: datafu
 
 
 Description
 ---
 
 DATAFU-61 - Add TF-IDF Macro to DataFu
 
 
 Diffs
 -
 
   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/27820/diff/
 
 
 Testing
 ---
 
 Works for me, but testing not automated. See 
 https://issues.apache.org/jira/browse/DATAFU-61
 
 
 Thanks,
 
 Russell Jurney
 




Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-10 Thread Russell Jurney

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/
---

Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
Sam Shah.


Repository: datafu


Description
---

DATAFU-61 - Add TF-IDF Macro to DataFu


Diffs
-

  datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
  datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 

Diff: https://reviews.apache.org/r/27820/diff/


Testing
---

Works for me, but testing not automated.


Thanks,

Russell Jurney



Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2014-11-10 Thread Russell Jurney

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/
---

(Updated Nov. 10, 2014, 6:33 p.m.)


Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
Sam Shah.


Repository: datafu


Description
---

DATAFU-61 - Add TF-IDF Macro to DataFu


Diffs
-

  datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
  datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 

Diff: https://reviews.apache.org/r/27820/diff/


Testing (updated)
---

Works for me, but testing not automated. See 
https://issues.apache.org/jira/browse/DATAFU-61


Thanks,

Russell Jurney