[jira] [Commented] (DATAFU-137) KEYS sigs and hashes must be linked from the download page

2018-02-15 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365386#comment-16365386
 ] 

Eyal Allweil commented on DATAFU-137:
-

[~sheric], can you take a look at this?

> KEYS sigs and hashes must be linked from the download page
> --
>
> Key: DATAFU-137
> URL: https://issues.apache.org/jira/browse/DATAFU-137
> Project: DataFu
>  Issue Type: Bug
>Reporter: Sebb
>Priority: Major
>
> The download page refers to verifying downloads, but leaves it up to the user 
> to find the sigs and hashes.
> The sigs and hashes should be listed beside the files to which they apply.
> See for example:
> https://httpd.apache.org/download.cgi#apache24
> which uses a simple table.
> URLs should use https, e.g.
> https://www.apache.org/dist/incubator/datafu/apache-datafu-incubating-1.3.3/apache-datafu-incubating-sources-1.3.3.tgz.asc
> The KEYS file URL should use https, and must use the ASF main host, i.e.
> https://www.apache.org/dist/incubator/datafu/KEYS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DATAFU-132) Make DataFu compile with Java 8

2018-01-08 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-132:
---

 Summary: Make DataFu compile with Java 8
 Key: DATAFU-132
 URL: https://issues.apache.org/jira/browse/DATAFU-132
 Project: DataFu
  Issue Type: Improvement
Affects Versions: 1.3.3
Reporter: Eyal Allweil
 Attachments: DATAFU-132.patch

Currently DataFu only compiles with Java 7. It would be great if Java 8 could 
be used. The attached patch makes this possible by making the Java version 
check more permissive (anything about Java 7).

I've done a build and tests, and deployed the resulting jar and tested a macro 
and UDF with it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-99) Can't build on Windows

2018-01-08 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-99:
---
Attachment: DATAFU-99.patch

I've worked out an ugly fix for this problem that sidesteps the problem 
described in the bug.

Basically, the file-locking-with-hashes-when-copying-within-the-same-folder 
problem exists in the [copy 
task|https://docs.gradle.org/3.5.1/userguide/working_with_files.html#sec:copying_files]
 because it attempts to be incremental.

What I did was copy the gradle.properties file to a temporary folder, and then 
use the [copy 
method|https://docs.gradle.org/3.5.1/dsl/org.gradle.api.Project.html#org.gradle.api.Project:copy(org.gradle.api.Action)]
 to copy the changed file back into the original directory.

Like I wrote at the beginning - ugly, but fixes the Windows build and it 
doesn't take long to copy a single small text file one extra time.

> Can't build on Windows
> --
>
> Key: DATAFU-99
> URL: https://issues.apache.org/jira/browse/DATAFU-99
> Project: DataFu
>  Issue Type: Bug
>Reporter: Matthew Hayes
> Attachments: DATAFU-99.patch
>
>
> [~ihadanny] reported that there is an issue building on Windows due to some 
> Gradle bug: 
> https://discuss.gradle.org/t/error-with-a-copy-task-on-windows/1803/3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-99) Can't build on Windows

2018-01-07 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315193#comment-16315193
 ] 

Eyal Allweil commented on DATAFU-99:


Until this is solved completely, please note that building the projects 
separately *DOES work*.

Instead of _./gradlew clean assemble_

You can use _./gradlew clean :datafu-pig:assemble_ or _./gradlew clean 
:datafu-hourglass:assemble_





> Can't build on Windows
> --
>
> Key: DATAFU-99
> URL: https://issues.apache.org/jira/browse/DATAFU-99
> Project: DataFu
>  Issue Type: Bug
>Reporter: Matthew Hayes
>
> [~ihadanny] reported that there is an issue building on Windows due to some 
> Gradle bug: 
> https://discuss.gradle.org/t/error-with-a-copy-task-on-windows/1803/3



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2017-12-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil closed DATAFU-116.
---
Resolution: Won't Fix

Since it seems like Pig doesn't use the Accumulator interface when there are 
multiple bags in the input, this improvement isn't relevant for these UDF's.

> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-60) Support NDCG calculation within a UDF

2017-12-09 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284984#comment-16284984
 ] 

Eyal Allweil commented on DATAFU-60:


Hi [~jhartman], I know it's been years, but do you think you'll get back to 
this? If not someone else will finish it off, since there isn't much left to do.

> Support NDCG calculation within a UDF
> -
>
> Key: DATAFU-60
> URL: https://issues.apache.org/jira/browse/DATAFU-60
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Joshua Hartman
>  Labels: features
> Attachments: DATAFU-60-v2.patch, DATAFU-60.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Ndcg is a common evaluation metric for the quality of a list of items 
> presented to a user. It is an especially common metric used in the search 
> literature.
> This feature request is to implement a udf to calculate ndcg. 
> NDCG can be calculated using any function to represent the value of a 
> position. Several useful functions should be available as part of the datafu 
> library. First is the standard 1/logarithmic discounting factor. Another 
> option should be the ability to supply a custom positional value for any 
> range of positions in the case that a positional "value" is already well 
> understood. However, the actual discounting function used should be easily 
> pluggable in the event something custom is needed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-47) UDF for Murmur3 (and other) Hash functions

2017-12-05 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-47:
---
Attachment: DATAFU-47-new.patch

I looked at the review board for this issue, and fixed the merge conflicts in 
HashTests and addressed the comments that were left. It depends on DATAFU-50, 
which was reopened, but I put a new patch there so that we can proceed with 
both.

Since I didn't create the review, I can't upload a new diff there, but I've 
attached it to the Jira issue, and commented in the review board where 
appropriate.

Tests pass, and I've run the content of "hasherTest" on a cluster using the 
assembled DataFu jar to make sure that the autojarring of the new Guava version 
works properly.

I'll respond to the review board comments later.

> UDF for Murmur3 (and other) Hash functions
> --
>
> Key: DATAFU-47
> URL: https://issues.apache.org/jira/browse/DATAFU-47
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>  Labels: Guava, Hash, UDF
> Attachments: 
> 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch, 
> 0001-UDF-for-Murmur3-and-other-Hash-functions.patch, DATAFU-47-new.patch
>
>
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (a fast hash with good 
> statistical properties), SipHash-2-4 (a fast cryptographically secure hash), 
> crc32, adler32, md5 and sha.
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
> [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. 
> Murmur3 is fast, with has exceptionally good statistical properties; it's a 
> good choice if all you need is good mixing of the inputs. It is _not_ 
> cryptographically secure; that is, given an  output value from murmur3, there 
> are efficient algorithms to find an input yielding the same output value. 
> Supply the seed as a string that 
> [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
>  can handle.
> * 'sip24', [optional seed]: Returns a [64-bit 
> SipHash-2-4|https://131002.net/siphash/]. SipHash is competitive in 
> performance with Murmur3, and is simpler and faster than the cryptographic 
> algorithms below. When used with a seed, it can be considered 
> cryptographically secure: given the output from a sip24 instance but not the 
> seed used, we cannot efficiently craft a message yielding the same output 
> from that instance.
> * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to 
> Java's Adler32 Checksum
> * 'crc32':   Returns a CRC-32 checksum (32 hash bits) by delegating to Java's 
> CRC32 Checksum.
> * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 
> MessageDigest.
> * 'sha1':Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
> MessageDigest.
> * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
> MessageDigest.
> * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
> MessageDigest.
> * 'good-(integer number of bits)': Returns a general-purpose, 
> non-cryptographic-strength, streaming hash function that produces hash codes 
> of length at least minimumBits. Users without specific compatibility 
> requirements and who do not persist the hash codes are encouraged to choose 
> this hash function. (Cryptographers, like dieticians and fashionistas, 
> occasionally realize that We've Been Doing it Wrong This Whole Time. Using 
> 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To 
> (Wear|Eat|Hash With) this Fall.) Values for this hash will change from run to 
> run.
> Examples: 
> {code}
>   define DefaultHdatafu.pig.hash.Hasher();
>   define GoodH   datafu.pig.hash.Hasher('good-32');
>   define BetterH datafu.pig.hash.Hasher('good-127');
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
>   define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
>   define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
>   define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
>   define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
>   define MD5Hdatafu.pig.hash.Hasher('md5');
>   define SHA1H   datafu.pig.hash.Hasher('sha1');
>   define SHA256H datafu.pig.hash.Hasher('sha256');
>   define SHA512H datafu.pig.hash.Hasher('sha512');
>   
>   data_in = LOAD 'input' as (val:chararray);
>   
>   data_out = FOREACH data_in GENERATE
> DefaultH(val),   GoodH(val),   BetterH(val),
> MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
> MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
> SHA1H(val),   SHA256H(val),

[jira] [Commented] (DATAFU-30) Website crawl errors for class use links

2017-11-30 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273148#comment-16273148
 ] 

Eyal Allweil commented on DATAFU-30:


I think newer Javadoc versions don't have a "Use" button, so this error doesn't 
happen anymore ... I looked at 1.3.0 and 1.3.1. If that's true, I suggest 
closing this issue, since fixing an older version's javadoc seems unimportant 
to me.

> Website crawl errors for class use links
> 
>
> Key: DATAFU-30
> URL: https://issues.apache.org/jira/browse/DATAFU-30
> Project: DataFu
>  Issue Type: Bug
>Reporter: Matthew Hayes
> Attachments: 
> datafu-incubator-apache-org_20140131T042403Z_CrawlErrors.csv
>
>
> Google webmaster tools has reported crawl errors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (DATAFU-118) Automatically run rat task when running assemble

2017-11-30 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil reassigned DATAFU-118:
---

Assignee: Eyal Allweil

> Automatically run rat task when running assemble
> 
>
> Key: DATAFU-118
> URL: https://issues.apache.org/jira/browse/DATAFU-118
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
>Priority: Minor
>
> The rat task checks that our files have the right headers.  We don't 
> automatically run it for assemble so it isn't easy for new contributors to 
> catch issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-125) Upgrade Gradle to v4 or later

2017-11-28 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-125.
-
Resolution: Fixed

Merged - everything looks fine to me. I repeated these tests, and tried out the 
jar with a simple Pig script.

I also added the three deleted files to our _.gitignore_, as [~cur4so] 
suggested (in a different Jira issue).

> Upgrade Gradle to v4 or later
> -
>
> Key: DATAFU-125
> URL: https://issues.apache.org/jira/browse/DATAFU-125
> Project: DataFu
>  Issue Type: Task
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
> Attachments: DATAFU-125-v2.patch, DATAFU-125.patch
>
>
> We should update to the most recent version of Gradle.  We're currently using 
> 2.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

2017-11-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258480#comment-16258480
 ] 

Eyal Allweil commented on DATAFU-63:


I wonder if the gradlew script is there because it can be "just used" on some 
systems. On one computer I couldn't find a gradle installation at all - maybe I 
skipped the bootstrapping step and the script worked for me. I'm no gradle 
expert so I don't know if this is feasible. If not, it sounds like a good idea 
to open a new issue for removing the gradlew, following your suggestion.

I looked at the code you linked, and I'll try again to answer your questions 
(in this issue and in the comments in the code).

The way the Algebraic interface works in Pig is that Pig calls 
EvalFunc.getInitial to get the name of the (inner) class with the Initial 
implementation, and creates it. You can see this code in POUserFunc in the 
Apache Pig project, it's not part of DataFu. The Initial class is then 
instantiated, and the exec there is called. The output is gathered and then the 
same process happens for Intermediate, and the output from Initial becomes the 
input for Intermediate. Finally, all these outputs from various mappers are 
sent to a reducer and the getFinal method is used to instantiate the Final 
class.

This means the implementations you've suggested would only work if this UDF was 
implementing the EvalFunc or AccumulatorEvalFunc interfaces. But in that case 
we already have a UDF that gives us a comparable implementation in DataFu - 
[ReservoirSample|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/sampling/ReservoirSample.java],
 so I don't think it's what is desired in this particular Jira issue.

This brings us back to what Matthew wrote way back in 2014 - implementing 
algorithm 6 from [this 
paper|http://jmlr.org/proceedings/papers/v28/meng13a.pdf]. I think what that 
algorithm provides is a way to generate a _k_-sized sample without ever holding 
all k items in memory in the mapper (which is the limitation for 
ReservoirSample, it won't work if _k_ is too big). 

Does that make sense?

> SimpleRandomSample by a fixed number
> 
>
> Key: DATAFU-63
> URL: https://issues.apache.org/jira/browse/DATAFU-63
> Project: DataFu
>  Issue Type: New Feature
>Reporter: jian wang
>Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-11-15 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253564#comment-16253564
 ] 

Eyal Allweil commented on DATAFU-130:
-

Hi [~varunu28],

Thank you for your interest! Do you have any experience with Pig?

I can't seem to assign this issue to you, but that isn't really necessary to 
work on it. Go right ahead!

For setting up your environment, you should follow [this 
guide:|http://datafu.incubator.apache.org/community/contributing.htm]

We haven't published instructions for how to contribute Pig macros. I'll try to 
write a rough draft of a guide and email it here or put it up on our wiki.
In the meantime, you can look at 
[count_macros.pig|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/resources/datafu/count_macros.pig]
 for an example of a macro file (though all you need to do is copy the macro in 
the Jira to the macros directory), and 
[MacroTests.java|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/test/java/datafu/test/pig/macros/MacroTests.java]
 for an example of a test. You can add your test to this file, actually.

For a guide to how to prepare a patch file, you can look 
[here|https://cwiki.apache.org/confluence/display/DATAFU/Contributing+to+Apache+DataFu].


> Add left outer join macro described in the DataFu guide
> ---
>
> Key: DATAFU-130
> URL: https://issues.apache.org/jira/browse/DATAFU-130
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>  Labels: macro, newbie
>
> In our 
> [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
> macro is described for making a three-way left outer join conveniently. We 
> can add this macro to DataFu to make it even easier to use.
> The macro's code is as follows:
> {noformat}
> DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
> returns joined {
>   cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
> $key3;
>   $joined = FOREACH cogrouped GENERATE
> FLATTEN($relation1),
> FLATTEN(EmptyBagToNullFields($relation2)),
> FLATTEN(EmptyBagToNullFields($relation3));
> }
> {noformat}
> (we would obviously want to add a test for this, too)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number

2017-11-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249793#comment-16249793
 ] 

Eyal Allweil commented on DATAFU-63:


Hi [~cur4so],

I'll quickly answer your last comment - I'll get to the previous one as soon as 
I can. We do indeed use still gradle 2.4 in the master branch. We're [about to 
update to Gradle 3.5|https://issues.apache.org/jira/browse/DATAFU-125], but it 
hasn't been merged yet.

However, when I did the gradle bootstrapping, it didn't modify my _gradlew_ 
file - what OS are you on? (BTW - we can't add it to the gitignore because it's 
checked into the repository, and you can't ignore files that are checked in)

> SimpleRandomSample by a fixed number
> 
>
> Key: DATAFU-63
> URL: https://issues.apache.org/jira/browse/DATAFU-63
> Project: DataFu
>  Issue Type: New Feature
>Reporter: jian wang
>Assignee: jian wang
>
> SimpleRandomSample currently supports random sampling by probability, it does 
> not support random sample a fixed number of items. ReserviorSample may do the 
> work but since it relies on an in-memory priority queue, memory issue may 
> happen if we are going to sample a huge number of items, eg: sample 100M from 
> 100G data. 
> Suggested approach is to create a new class "SimpleRandomSampleByCount" that 
> uses Manuver's rejection threshold to reject items whose weight exceeds the 
> threshold as we go from mapper to combiner to reducer. The majority part of 
> the algorithm will be very similar to SimpleRandomSample, except that we do 
> not use Berstein's theory to accept items and replace probability p = k / n,  
> k is the number of items to sample, n is the total number of items local in 
> mapper, combiner and reducer.
> Quote this requirement from others:
> "Hi folks,
> Question: does anybody know if there is a quicker way to randomly sample a 
> specified number of rows from grouped data? I’m currently doing this, since 
> it appears that the SAMPLE operator doesn’t work inside FOREACH statements:
> photosGrouped = GROUP photos BY farm;
> agg = FOREACH photosGrouped {
>   rnds = FOREACH photos GENERATE *, RANDOM() as rnd;
>   ordered_rnds = ORDER rnds BY rnd;
>   limitSet = LIMIT ordered_rnds 5000;
>   GENERATE group AS farm,
>FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, 
> secret);
> };
> This approach seems clumsy, and appears to run quite slowly (I’m assuming the 
> ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do 
> this?
> Thanks,
> "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-48) Upgrade Guava to 20.0

2017-10-29 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-48.

   Resolution: Fixed
 Assignee: Eyal Allweil  (was: Philip (flip) Kromer)
Fix Version/s: 1.3.3

Merged

> Upgrade Guava to 20.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: build, dependency, guava, version
> Fix For: 1.3.3
>
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-48) Upgrade Guava to 20.0

2017-10-29 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-48:
---
Summary: Upgrade Guava to 20.0  (was: Upgrade Guava to 17.0)

> Upgrade Guava to 20.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later

2017-10-26 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220371#comment-16220371
 ] 

Eyal Allweil commented on DATAFU-125:
-

When I build with _./gradlew clean release -Prelease=true_, I don't get a zip 
file. In fact, I don't get jars either - I need to use _assemble_ to make them 
(both on master with and without upgrading Gradle). Am I using the wrong 
command? Does it work for you, [~matterhayes]?

> Upgrade Gradle to v4 or later
> -
>
> Key: DATAFU-125
> URL: https://issues.apache.org/jira/browse/DATAFU-125
> Project: DataFu
>  Issue Type: Task
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
> Attachments: DATAFU-125.patch
>
>
> We should update to the most recent version of Gradle.  We're currently using 
> 2.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (DATAFU-131) Update DataFu site to meet graduation requirements

2017-10-26 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil reassigned DATAFU-131:
---

Assignee: Matthew Hayes

> Update DataFu site to meet graduation requirements
> --
>
> Key: DATAFU-131
> URL: https://issues.apache.org/jira/browse/DATAFU-131
> Project: DataFu
>  Issue Type: Bug
>Reporter: Eyal Allweil
>Assignee: Matthew Hayes
> Attachments: DATAFU-131.patch, Screen Shot 2017-10-25 at 7.21.09 
> PM.png
>
>
> The following issues were raised with the [DataFu web 
> site|http://datafu.incubator.apache.org] as part of the [graduation 
> discussion on the incubator general maiing 
> list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E]
> There's no link to the main ASF website.
> There's no LICENSE or Thanks link.
> There's no download link.
> etc.
> The quick start guide pages do have download links, but the primary
> link is to Maven rather than the ASF, and there are no instructions as
> to how to check sigs or hashes, and no link to the KEYS file that I
> could find.
> The SHA-512 checksum must have the extension .sha512
> http://www.apache.org/dev/release-distribution.html#sigs-and-sums
> Also the latest release appears to be 1.3.2 (dated Feb 2017) but the
> download links point to 1.3.1.
> The older releases (1.3.1 and 1.3.0) should have been deleted from the
> release/dist directory by now.
> There's no Apache feather logo which is often used as the link to the
> main ASF site.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-17) Improve testing of randomized functions

2017-10-22 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-17?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214398#comment-16214398
 ] 

Eyal Allweil commented on DATAFU-17:


I think we can close this, just as we closed 
[DATAFU-28|https://issues.apache.org/jira/browse/DATAFU-28]. If all the tests 
take less than twenty minutes now I don't think it's worth making an effort to 
minimize the randomized functions.

> Improve testing of randomized functions
> ---
>
> Key: DATAFU-17
> URL: https://issues.apache.org/jira/browse/DATAFU-17
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Will Vaughan
>
> We have a large number of UDFs with a random component that are difficult and 
> often slow to test.  We should improve our testing standards and capabilities 
> for this class of functions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0

2017-10-22 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214212#comment-16214212
 ] 

Eyal Allweil commented on DATAFU-48:


As an additional check, I ran a Pig script which uses 
_SimpleRandomSampleWithReplacementVote_ (which uses Guava) to see that it still 
runs correctly.

> Upgrade Guava to 17.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later

2017-10-22 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214208#comment-16214208
 ] 

Eyal Allweil commented on DATAFU-125:
-

_check_ and _clean release_ run and return SUCCESS. Are there any special files 
I should check that are the result of the _release_ task?

I also ran a script on the packaged jar (the regular one, not core or the 
jarjar) and it ran fine.

> Upgrade Gradle to v4 or later
> -
>
> Key: DATAFU-125
> URL: https://issues.apache.org/jira/browse/DATAFU-125
> Project: DataFu
>  Issue Type: Task
>Reporter: Matthew Hayes
> Attachments: DATAFU-125.patch
>
>
> We should update to the most recent version of Gradle.  We're currently using 
> 2.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-118) Automatically run rat task when running assemble

2017-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211505#comment-16211505
 ] 

Eyal Allweil commented on DATAFU-118:
-

(because we have a patch that seems to work on a newer Gradle version linked in 
the review board)

> Automatically run rat task when running assemble
> 
>
> Key: DATAFU-118
> URL: https://issues.apache.org/jira/browse/DATAFU-118
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
>
> The rat task checks that our files have the right headers.  We don't 
> automatically run it for assemble so it isn't easy for new contributors to 
> catch issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-32) Hourglass concrete jobs should have getters and setters for output name and namespace

2017-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210941#comment-16210941
 ] 

Eyal Allweil commented on DATAFU-32:


Is this still relevant? If so, I'll open a [Help Wanted 
task|https://helpwanted.apache.org/] for it.

> Hourglass concrete jobs should have getters and setters for output name and 
> namespace
> -
>
> Key: DATAFU-32
> URL: https://issues.apache.org/jira/browse/DATAFU-32
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Matthew Hayes
>
> With the abstract versions you can override getOutputSchemaName() and 
> getOutputSchemaNamespace().  But the concrete versions don't expose setters, 
> so you have to extend the class to override the defaults.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0

2017-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210769#comment-16210769
 ] 

Eyal Allweil commented on DATAFU-48:


None, actually. Hadoop 1 and 2 are using 11.0.2, like us. Hadoop 3 is [using 
21|https://issues.apache.org/jira/browse/HADOOP-10101].

> Upgrade Guava to 17.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-131) Update DataFu site to meet graduation requirements

2017-10-10 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199209#comment-16199209
 ] 

Eyal Allweil commented on DATAFU-131:
-

Here's a link to the Apache site guidelines:

https://www.apache.org/foundation/marks/pmcs#navigation

> Update DataFu site to meet graduation requirements
> --
>
> Key: DATAFU-131
> URL: https://issues.apache.org/jira/browse/DATAFU-131
> Project: DataFu
>  Issue Type: Bug
>Reporter: Eyal Allweil
>
> The following issues were raised with the [DataFu web 
> site|http://datafu.incubator.apache.org] as part of the [graduation 
> discussion on the incubator general maiing 
> list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E]
> There's no link to the main ASF website.
> There's no LICENSE or Thanks link.
> There's no download link.
> etc.
> The quick start guide pages do have download links, but the primary
> link is to Maven rather than the ASF, and there are no instructions as
> to how to check sigs or hashes, and no link to the KEYS file that I
> could find.
> The SHA-512 checksum must have the extension .sha512
> http://www.apache.org/dev/release-distribution.html#sigs-and-sums
> Also the latest release appears to be 1.3.2 (dated Feb 2017) but the
> download links point to 1.3.1.
> The older releases (1.3.1 and 1.3.0) should have been deleted from the
> release/dist directory by now.
> There's no Apache feather logo which is often used as the link to the
> main ASF site.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-131) Update DataFu site to meet graduation requirements

2017-10-10 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-131:
---

 Summary: Update DataFu site to meet graduation requirements
 Key: DATAFU-131
 URL: https://issues.apache.org/jira/browse/DATAFU-131
 Project: DataFu
  Issue Type: Bug
Reporter: Eyal Allweil


The following issues were raised with the [DataFu web 
site|http://datafu.incubator.apache.org] as part of the [graduation discussion 
on the incubator general maiing 
list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E]

There's no link to the main ASF website.
There's no LICENSE or Thanks link.
There's no download link.
etc.

The quick start guide pages do have download links, but the primary
link is to Maven rather than the ASF, and there are no instructions as
to how to check sigs or hashes, and no link to the KEYS file that I
could find.

The SHA-512 checksum must have the extension .sha512

http://www.apache.org/dev/release-distribution.html#sigs-and-sums

Also the latest release appears to be 1.3.2 (dated Feb 2017) but the
download links point to 1.3.1.

The older releases (1.3.1 and 1.3.0) should have been deleted from the
release/dist directory by now.

There's no Apache feather logo which is often used as the link to the
main ASF site.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-87) Edit distance

2017-10-09 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197120#comment-16197120
 ] 

Eyal Allweil commented on DATAFU-87:


On second thought, since this UDF is now available in Hive, and since 
Levenshtein distance is a purely local computation, I'm guessing there's no 
need for a specific DataFu implementation. Shall we close this issue?

Here are some links to the Hive UDF.

https://issues.apache.org/jira/browse/HIVE-9556

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



> Edit distance
> -
>
> Key: DATAFU-87
> URL: https://issues.apache.org/jira/browse/DATAFU-87
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Joydeep Banerjee
> Attachments: DATAFU-87.patch
>
>
> [This is work-in-progress]
> Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) 
> between them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-48) Upgrade Guava to 17.0

2017-10-08 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-48:
---
Attachment: DATAFU-48-update-gradle-to-20.0.patch

I checked, and Guava 20.0 is the last version that we can update to without 
getting into a Java version conflict. So this is a patch that updates Guava to 
20.0.

The tests all pass (build plugin, hourglass, and pig) and I ran a simple Pig 
script that uses the generated DataFu pig jar to see that it's still valid.

Let's close this ancient ticket!



> Upgrade Guava to 17.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL

2017-10-08 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196058#comment-16196058
 ] 

Eyal Allweil commented on DATAFU-12:


[~matterhayes], anyone, what do you think? I wouldn't "waste" our time on 
something that can already be done in Pig via Hive, and I'd like to close 
jira's that are no longer relevant.

> Implement Lead UDF based on version from SQL
> 
>
> Key: DATAFU-12
> URL: https://issues.apache.org/jira/browse/DATAFU-12
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Matthew Hayes
>
> Min Zhou has provided this suggestion ([Issue #88 on 
> GitHub|https://github.com/linkedin/datafu/pull/88]):
> Lead is an analytic function like Oracle's Lead function. It provides access 
> to more than one tuple of a bag at the same time without a self join. Given a 
> bag of tuple returned from a query, LEAD provides access to a tuple at a 
> given physical offset beyond that position. Generates pairs of all items in a 
> bag.
> If you do not specify offset, then its default is 1. Null is returned if the 
> offset goes beyond the scope of the bag.
> Example 1:
> {noformat}
>register ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead('2');
>-- INPUT: ({(1),(2),(3),(4)})
>data = LOAD 'input' AS (data: bag {T: tuple(v:INT)});
>describe data;
>-- OUTPUT:  ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)})
>-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: 
> int),elem2: (v: int))}}
>data2 = FOREACH data GENERATE Lead(data);
>describe data2;
>DUMP data2;
> {noformat}
> Example 2
> {noformat}
>register  ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead();
>-- INPUT: 
> ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})})
>data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: 
> tuple(v2:INT)})});
>--describe data;
>-- OUPUT: 
> ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)})
>data2 = FOREACH data GENERATE Lead(data);
>--describe data2;
>DUMP data2;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-09-19 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-130:

Description: 
In our 
[guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
macro is described for making a three-way left outer join conveniently. We can 
add this macro to DataFu to make it even easier to use.

The macro's code is as follows:


{noformat}
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
returns joined {
  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
$key3;
  $joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}

{noformat}

(we would obviously want to add a test for this, too)



  was:
In our 
[guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
macro is described for making a three-way left outer join conveniently. We can 
add this macro to DataFu to make it even easier to use.

The macro's code is as follows:


{noformat}
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
returns joined {
  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
$key3;
  $joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}

(we would obviously want to add a test for this, too)

{noformat}




> Add left outer join macro described in the DataFu guide
> ---
>
> Key: DATAFU-130
> URL: https://issues.apache.org/jira/browse/DATAFU-130
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>  Labels: macro, newbie
>
> In our 
> [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
> macro is described for making a three-way left outer join conveniently. We 
> can add this macro to DataFu to make it even easier to use.
> The macro's code is as follows:
> {noformat}
> DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
> returns joined {
>   cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
> $key3;
>   $joined = FOREACH cogrouped GENERATE
> FLATTEN($relation1),
> FLATTEN(EmptyBagToNullFields($relation2)),
> FLATTEN(EmptyBagToNullFields($relation3));
> }
> {noformat}
> (we would obviously want to add a test for this, too)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-09-17 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169225#comment-16169225
 ] 

Eyal Allweil commented on DATAFU-130:
-

I think this is a good Jira issue to put in the [Apache Help Wanted 
site|https://helpwanted.apache.org/]. If there's no objection, I'll add it 
there.

> Add left outer join macro described in the DataFu guide
> ---
>
> Key: DATAFU-130
> URL: https://issues.apache.org/jira/browse/DATAFU-130
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>  Labels: macro, newbie
>
> In our 
> [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
> macro is described for making a three-way left outer join conveniently. We 
> can add this macro to DataFu to make it even easier to use.
> The macro's code is as follows:
> {noformat}
> DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
> returns joined {
>   cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
> $key3;
>   $joined = FOREACH cogrouped GENERATE
> FLATTEN($relation1),
> FLATTEN(EmptyBagToNullFields($relation2)),
> FLATTEN(EmptyBagToNullFields($relation3));
> }
> (we would obviously want to add a test for this, too)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-61.

Resolution: Fixed
  Assignee: Eyal Allweil

Merged.

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165991#comment-16165991
 ] 

Eyal Allweil commented on DATAFU-61:


Yes, I'll merge it.

I did respond to an open issue in the review request that I only just noticed, 
something about using COUNT vs. SUM when calculating the IDF part ... as far as 
I can tell, the existing code is OK but it wouldn't hurt if you or Russell want 
to take a look at it.

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-09-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165925#comment-16165925
 ] 

Eyal Allweil commented on DATAFU-119:
-

The documentation can be part of 
[DATAFU-128|https://issues.apache.org/jira/browse/DATAFU-128].

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-119) New UDF - TupleDiff

2017-09-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-119:

Attachment: DATAFU-119-2.patch

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164373#comment-16164373
 ] 

Eyal Allweil commented on DATAFU-61:


One last thing - I noticed after I uploaded my patch that it has my email, but 
I think it would be better for it to have your email, [~russell.jurney], since 
all I did was write the test. Is it OK that I replace my email with yours 
before committing this, so we get a (more accurate) "eyal committed with 
russell" type commit?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-130:
---

 Summary: Add left outer join macro described in the DataFu guide
 Key: DATAFU-130
 URL: https://issues.apache.org/jira/browse/DATAFU-130
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil


In our 
[guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
macro is described for making a three-way left outer join conveniently. We can 
add this macro to DataFu to make it even easier to use.

The macro's code is as follows:


{noformat}
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
returns joined {
  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
$key3;
  $joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}

(we would obviously want to add a test for this, too)

{noformat}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-128) Add documentation for macros

2017-09-12 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162936#comment-16162936
 ] 

Eyal Allweil commented on DATAFU-128:
-

Is the documentation for updating the website accurate? There are references to 
svn in there, which lead me to think they might not be relevant anymore ...

> Add documentation for macros
> 
>
> Key: DATAFU-128
> URL: https://issues.apache.org/jira/browse/DATAFU-128
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Eyal Allweil
>
> Now that it is possible to add Pig macros to Datafu, we should update the 
> documentation to reflect this, and provide guidelines and point would-be 
> contributors to examples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-129) New macro - dedup

2017-09-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-129:

Attachment: DATAFU-129.patch

Macro and test

> New macro - dedup
> -
>
> Key: DATAFU-129
> URL: https://issues.apache.org/jira/browse/DATAFU-129
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
> ordering (typically a date updated field).
> One thing to consider - the implementation relies on the 
> ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
> dependencies in order for the test to run. While I feel that anyone using Pig 
> typically has PiggyBank in the classpath, this might not be true - do we have 
> an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-129) New macro - dedup

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-129:
---

 Summary: New macro - dedup
 Key: DATAFU-129
 URL: https://issues.apache.org/jira/browse/DATAFU-129
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil


Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
ordering (typically a date updated field).

One thing to consider - the implementation relies on the 
ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
dependencies in order for the test to run. While I feel that anyone using Pig 
typically has PiggyBank in the classpath, this might not be true - do we have 
an alternative? (maybe adding it to the jarjar?)

The macro's definition looks as follows:

DEFINE dedup(relation, row_key, order_field) returns out {

relation - relation to dedup
row_key - field(s) for group by
order_field - the field for ordering (to find the most recent record)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-128) Add documentation for macros

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-128:
---

 Summary: Add documentation for macros
 Key: DATAFU-128
 URL: https://issues.apache.org/jira/browse/DATAFU-128
 Project: DataFu
  Issue Type: Improvement
Reporter: Eyal Allweil


Now that it is possible to add Pig macros to Datafu, we should update the 
documentation to reflect this, and provide guidelines and point would-be 
contributors to examples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-127) New macro - samply by keys

2017-09-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-127:

Attachment: DATAFU-127.patch

Patch including new macros and tests

> New macro - samply by keys
> --
>
> Key: DATAFU-127
> URL: https://issues.apache.org/jira/browse/DATAFU-127
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-127.patch
>
>
> Two macros that return a sample of a larger table based on a list of keys, 
> with the schema of the larger table. One of the macros filters by dates, the 
> other doesn't.
> If there are multiple rows with a key that appears in the key list, all of 
> them will be returned (no deduplication is done). The results are returned 
> ordered by the key field in a single file.
> The implementation uses a replicated join for efficiency, but this means the 
> key list shouldn't be too large as to not fit in memory.
> The first macro's definition looks as follows:
> DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) 
> returns out {
> - table_name  - table name to sample
> - sample_set  - a set of keys
> - join_key_table  - join column name in the table
> - join_key_sample - join column name in the sample



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-127) New macro - samply by keys

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-127:
---

 Summary: New macro - samply by keys
 Key: DATAFU-127
 URL: https://issues.apache.org/jira/browse/DATAFU-127
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil


Two macros that return a sample of a larger table based on a list of keys, with 
the schema of the larger table. One of the macros filters by dates, the other 
doesn't.

If there are multiple rows with a key that appears in the key list, all of them 
will be returned (no deduplication is done). The results are returned ordered 
by the key field in a single file.

The implementation uses a replicated join for efficiency, but this means the 
key list shouldn't be too large as to not fit in memory.

The first macro's definition looks as follows:

DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) 
returns out {

- table_name- table name to sample
- sample_set- a set of keys
- join_key_table- join column name in the table
- join_key_sample   - join column name in the sample





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-126) There is a typo in document

2017-09-11 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-126.
-
Resolution: Fixed

> There is a typo in document
> ---
>
> Key: DATAFU-126
> URL: https://issues.apache.org/jira/browse/DATAFU-126
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Kane
>Assignee: Eyal Allweil
>Priority: Minor
>
> It should be "functions" not "functiosn" in the document page 
> https://datafu.incubator.apache.org/docs/datafu/guide.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-126) There is a typo in document

2017-09-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161252#comment-16161252
 ] 

Eyal Allweil commented on DATAFU-126:
-

Thanks Kane! I've fixed this in our sources, and it will show up when we 
release our next version.

> There is a typo in document
> ---
>
> Key: DATAFU-126
> URL: https://issues.apache.org/jira/browse/DATAFU-126
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Kane
>Assignee: Eyal Allweil
>Priority: Minor
>
> It should be "functions" not "functiosn" in the document page 
> https://datafu.incubator.apache.org/docs/datafu/guide.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (DATAFU-126) There is a typo in document

2017-09-11 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil reassigned DATAFU-126:
---

Assignee: Eyal Allweil

> There is a typo in document
> ---
>
> Key: DATAFU-126
> URL: https://issues.apache.org/jira/browse/DATAFU-126
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Kane
>Assignee: Eyal Allweil
>Priority: Minor
>
> It should be "functions" not "functiosn" in the document page 
> https://datafu.incubator.apache.org/docs/datafu/guide.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible

2017-09-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161211#comment-16161211
 ] 

Eyal Allweil commented on DATAFU-83:


By the way, [~ItsAUsernameRight?], if you're already looking at InUDF, and 
you'd like another contribution afterwards, you can also look at 
[DATAFU-80|https://issues.apache.org/jira/browse/DATAFU-80] - it's another 
small change to improve InUDF's behavior. (you can ignore the second part of 
that issue, which deals with Java versions).


> InUDF does not validate that types are compatible
> -
>
> Key: DATAFU-83
> URL: https://issues.apache.org/jira/browse/DATAFU-83
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
> Attachments: DATAFU-83.patch, rb36702.patch
>
>
> See the example below.  The input data is a long, but ints are provided to 
> match against.  Because it uses the Java equals to compare and these are 
> different types, this will never match, which can lead to confusing results.  
> I believe it should at least throw an error.
> {code}
>   define I datafu.pig.util.InUDF();
>   
>   data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)});
>   
>   data2 = FOREACH data {
> C = FILTER B By I(v, 1,2,3);
> GENERATE C;
>   }
>   
>   describe data2;
>   
>   STORE data2 INTO 'output';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161118#comment-16161118
 ] 

Eyal Allweil commented on DATAFU-61:


Came back to this today and tried a little experiment - I verified (calculating 
manually) that the Russell's code produces the same results as the "augmented 
TF" IDF flavor for the sample I took from the wikipedia page. Is that good 
enough for us?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-08-06 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-61:
---
Attachment: DATAFU-61-2.patch

Now that macros are supported (and can be tested), I updated this patch. 
Unfortunately, I couldn't find the sample data, so I just pulled the sample 
sentences from the Wikipedia page for TF-IDF, and I didn't verify that the 
results are OK. [~russell.jurney] - want to donate a test case and expected 
results?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-08-06 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115744#comment-16115744
 ] 

Eyal Allweil commented on DATAFU-119:
-

[~matterhayes] - We want the Apache license header on our macro files too, 
right? If so, I'll add it to the sample macro from 
[DATAFU-123|https://issues.apache.org/jira/browse/DATAFU-123] as well.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods

2017-06-29 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067884#comment-16067884
 ] 

Eyal Allweil commented on DATAFU-124:
-

I reviewed it - looks fine, a nice improvement. I'll try to get it committed 
soon (unless of course someone has any actionable comments)

> sessionize() ought to support millisecond periods
> -
>
> Key: DATAFU-124
> URL: https://issues.apache.org/jira/browse/DATAFU-124
> Project: DataFu
>  Issue Type: Bug
>Reporter: Jacob Tolar
>
> The sessionize UDF should support a period in milliseconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL

2017-04-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972397#comment-15972397
 ] 

Eyal Allweil commented on DATAFU-12:


It looks like this functionality is implemented in HIve - see the following two 
links:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-LEADusingdefault1rowleadandnotspecifyingdefaultvalue

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLead.java

Since Pig now supports using Hive UDF's, I think this Jira can be closed. 
Alternately, if we want to provide a DataFu implementation, I'll copy the 
proposed patch and discussion from the Github issue mentioned in the 
description, so it's easier for a possible-implementer to continue where work 
stalled.

> Implement Lead UDF based on version from SQL
> 
>
> Key: DATAFU-12
> URL: https://issues.apache.org/jira/browse/DATAFU-12
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Matthew Hayes
>
> Min Zhou has provided this suggestion ([Issue #88 on 
> GitHub|https://github.com/linkedin/datafu/pull/88]):
> Lead is an analytic function like Oracle's Lead function. It provides access 
> to more than one tuple of a bag at the same time without a self join. Given a 
> bag of tuple returned from a query, LEAD provides access to a tuple at a 
> given physical offset beyond that position. Generates pairs of all items in a 
> bag.
> If you do not specify offset, then its default is 1. Null is returned if the 
> offset goes beyond the scope of the bag.
> Example 1:
> {noformat}
>register ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead('2');
>-- INPUT: ({(1),(2),(3),(4)})
>data = LOAD 'input' AS (data: bag {T: tuple(v:INT)});
>describe data;
>-- OUTPUT:  ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)})
>-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: 
> int),elem2: (v: int))}}
>data2 = FOREACH data GENERATE Lead(data);
>describe data2;
>DUMP data2;
> {noformat}
> Example 2
> {noformat}
>register  ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead();
>-- INPUT: 
> ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})})
>data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: 
> tuple(v2:INT)})});
>--describe data;
>-- OUPUT: 
> ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)})
>data2 = FOREACH data GENERATE Lead(data);
>--describe data2;
>DUMP data2;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (DATAFU-123) Allow DataFu to include macros

2017-03-08 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-123:

Attachment: DATAFU-123.patch

The change ended up being smaller than what I originally described - all I did 
was add the "pig.import.search.path" property with the value of the 
src/main/resources directory to PigTests.

This means that any macro files that are put there can be tested, both in 
Gradle and Eclipse. I put some sample counting macros there and a test for them.

In general, any macro file placed in src/main/resources can be used by 
registering the DataFu jar.

If we include this patch, we should update the Contributing page so that 
instructions for contributing Pig macros are easy to find and understand.

> Allow DataFu to include macros 
> ---
>
> Key: DATAFU-123
> URL: https://issues.apache.org/jira/browse/DATAFU-123
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: testability
> Attachments: DATAFU-123.patch
>
>
> A few changes to allow macros to be contributed to DataFu. If a macro file is 
> placed in src/main/resources, it can be used by registering the DataFu jar. 
> Such macros can then be tested both from within Eclipse and Gradle.
> There are three small parts:
> 1) All unit tests that use createPigTest methods will automatically register 
> the DataFu jar.
> 2) Some changes to fix the PigTests.getJarPath() functionality, which doesn't 
> appear to work. (these changes are aligned with the proposed patch for 
> [DATAFU-106|https://issues.apache.org/jira/browse/DATAFU-106])
> 3) A sample macro and test
> The changes here will allow moving forward with 
> [DATAFU-61|https://issues.apache.org/jira/browse/DATAFU-61] and including the 
> macro I suggested for 
> [DATAFU-119|https://issues.apache.org/jira/browse/DATAFU-119]. (I also have 
> additional content in mind)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DATAFU-106) Test files should be created in a subfolder of projects

2017-01-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817823#comment-15817823
 ] 

Eyal Allweil commented on DATAFU-106:
-

[~takias], I will try to sort our Jira issues out and mark those that are 
easier to begin with. Have you worked on Pig UDF's before?

Piyush - I will try to finish our review as soon as I can!

> Test files should be created in a subfolder of projects
> ---
>
> Key: DATAFU-106
> URL: https://issues.apache.org/jira/browse/DATAFU-106
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
> Fix For: 1.3.1
>
>
> Test files are currently created in the subdirectory folder (e.g. 
> datafu-pig/input*).  For better organization, we should create them in a 
> subdirectory.  This also makes it easier to exclude them all with gitignore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-01-02 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793097#comment-15793097
 ] 

Eyal Allweil commented on DATAFU-119:
-

If we add DATAFU-123, we can include the macro I put in the description so that 
people can use it instead of duplicating it in order to conveniently call the 
UDF.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-87) Edit distance

2016-10-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606106#comment-15606106
 ] 

Eyal Allweil commented on DATAFU-87:


Hi Joydeep,

I want to begin by apologizing for the time it's taken us to get to your 
contribution. Did you ever continue with it? Have you compared your 
implementation with [the one in Apache Commons 
Text|https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java]
 or [Commons 
Lang|https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L7731]?
 (I think they follow the same algorithm, from _Algorithms on Strings, Trees 
and Sequences_ by Dan Gusfield and Chas Emerick)

> Edit distance
> -
>
> Key: DATAFU-87
> URL: https://issues.apache.org/jira/browse/DATAFU-87
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Joydeep Banerjee
> Attachments: DATAFU-87.patch
>
>
> [This is work-in-progress]
> Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) 
> between them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting

2016-10-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605952#comment-15605952
 ] 

Eyal Allweil commented on DATAFU-98:


Hi Russell.

First of all, I want to apologize for the time it's taken us to get to your 
contribution. I think it could be quite useful. Having said that, I wonder if 
the current version - without counters - gives us enough of an advantage over 
vanilla Pig. I think the following code (modified from your unit test) gives us 
nearly the same functionality as the UDF in the patch:

{noformat}
data_in = LOAD 'input' as (val:int);
-- data_in: "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "20"

intermediate_data = FOREACH data_in GENERATE val, (val / 5 * 5) AS binStart;

data_out = FOREACH (GROUP intermediate_data BY binStart) GENERATE group AS 
binStart, COUNT(intermediate_data) AS binCount;
-- data_out: (0,5),(5,5),(10,2),(20,1)

{noformat}

Unlike your UDF, missing bins are not included. But while including missing 
bins can be useful, I do wonder if a single skewed value can cause problems, 
especially with small bin sizes and long values. (as a performance-related 
aside, I would try to have FrequencyCounter.toBag() called only in the Final 
implementations, instead of the first two stages of the algebraic 
implementation, to minimize the data copied).

So it seems to me the current UDF has the advantage of having the missing bins, 
and it's obviously more readable and convenient than rewriting the Pig code I 
wrote above. Did you (or you, [~andrew.musselman]) run any performance tests? 
Maybe the Algebraic implementation runs faster than the vanilla Pig code by 
virtue of the combiner use.

Last (but not least!) the version you mentioned with counters sounds like it 
could be really great.


> New UDF for Histogram / Frequency counting
> --
>
> Key: DATAFU-98
> URL: https://issues.apache.org/jira/browse/DATAFU-98
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Russell Melick
> Attachments: DATAFU-98.patch
>
>
> I was thinking of creating a new UDF to compute histograms / frequency counts 
> of input bags.  It seems like it would make sense to support ints, longs, 
> float, and doubles.  
> I tried looking around to see if this was already implemented, but 
> ValueHistogram and AggregateWordHistogram were about the only things I found. 
>  They seem to exist as an example job, and only work for Strings.
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
> https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
> Should the user specify the bin size or the number of bins?  Specifying bin 
> size probably makes the implementation simpler since you can bin things 
> without having seen all of the data.
> I think it would make sense to implement a version of this that didn't need 
> any reducers.  It could use counters to keep track of the counts per bin 
> without sending any data to a reducer.  You would be able to call this 
> without a preceding GROUP BY as well.
> Here's my proposal for the two udfs.  This assumes the input data is two 
> columns, memberId and numConnections.
> {code}
> DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
> connections = LOAD 'connections' AS memberId, numConnections;
> connectionHistogram = FOREACH (GROUP connections ALL) GENERATE 
> BinnedFrequency(connections.numConnections);
> {code}
> The output here would be a bag with the frequency counts
> {code}
> {('0-49', 5), ('50-99', 0), ('100-149', 10)}
> {code}
> {code}
> DEFINE BinnedFrequencyCounter 
> datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
> connections = LOAD 'connections' AS memberId, numConnections;
> connections = FOREACH connections GENERATE 
> BinnedFrequencyCounter(numConnections);
> {code}
> The output here would just be a counter for each bin, all sharing the same 
> group of numConnectionsHistogram.  It would look something like
> numConnectionsHistogram.'0-49' = 5
> numConnectionsHistogram.'50-99' = 0
> numConnectionsHistogram.'100-149' = 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589407#comment-15589407
 ] 

Eyal Allweil edited comment on DATAFU-25 at 10/19/16 6:01 PM:
--

This is a minimal fix that uses getInputSchema() instead of the udf context, 
when possible.

It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], 
though not the example provided there - BagLeftOuterJoin. This is because 
MonitoredUDF prevents the udf context from working (see 
[PIG-3554|https://issues.apache.org/jira/browse/PIG-3554]), and BagJoin uses 
the udf context for other things, not just those that AliasableEvalFunc 
provides.

I couldn't think of a clean way of adding a test for this, but you can verify 
that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - 
this will make the TransposeTest.transposeTest fail, unless my patch is used. I 
didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding 
it in the test package means that TransposeTest can't access it.


was (Author: eyal):
This is a minimal fix that uses getInputSchema() instead of the udf context, 
when possible.

It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], 
though not the example provided there - BagLeftOuterJoin. This is because 
MonitoredUDF prevents the udf context from working (see 
[PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the 
udf context for other things, not just those that AliasableEvalFunc provides.

I couldn't think of a clean way of adding a test for this, but you can verify 
that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - 
this will make the TransposeTest.transposeTest fail, unless my patch is used. I 
didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding 
it in the test package means that TransposeTest can't access it.

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Will Vaughan
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-19 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-25:
---
Attachment: DATAFU-25.patch

This is a minimal fix that uses getInputSchema() instead of the udf context, 
when possible.

It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], 
though not the example provided there - BagLeftOuterJoin. This is because 
MonitoredUDF prevents the udf context from working (see 
[PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the 
udf context for other things, not just those that AliasableEvalFunc provides.

I couldn't think of a clean way of adding a test for this, but you can verify 
that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - 
this will make the TransposeTest.transposeTest fail, unless my patch is used. I 
didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding 
it in the test package means that TransposeTest can't access it.

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Will Vaughan
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2016-10-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586085#comment-15586085
 ] 

Eyal Allweil commented on DATAFU-16:


It looks like this got added - can this issue be closed?

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Assignee: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-45) RFE: CartesianProduct

2016-10-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584898#comment-15584898
 ] 

Eyal Allweil commented on DATAFU-45:


Hi Sam,

Did you ever solve this? I agree with Matthew that this should be doable via 
plain Pig - if not, I'd open a bug there.

> RFE: CartesianProduct
> -
>
> Key: DATAFU-45
> URL: https://issues.apache.org/jira/browse/DATAFU-45
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Sam Steingold
>
> Given two bags, produce their [Cartesian 
> product|http://en.wikipedia.org/wiki/Cartesian_product]:
> {code}
> B1: bag{T1}
> B2: bag{T2}
> CartesianProduct(B1,B2): bag{(T1,T2)}
> {code}
> Use case:
> {code}
> toks = TOKENIZE((charray)$0,',');
> kwds = CartesianProduct(toks, {1.0/(double)SIZE(toks)});
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-65) Aho-Corasick Pig UDF

2016-10-18 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-65:
---
Issue Type: New Feature  (was: Bug)

> Aho-Corasick Pig UDF
> 
>
> Key: DATAFU-65
> URL: https://issues.apache.org/jira/browse/DATAFU-65
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
> Environment: Drought
>Reporter: Russell Jurney
> Attachments: DATAFU-65.diff
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> I need to use the Aho-Corasick algorithm for efficient sub-string matching. A 
> java implementation is available at 
> https://github.com/robert-bor/aho-corasick and is available on maven central: 
> http://maven-repository.com/artifact/org.arabidopsis.ahocorasick/ahocorasick/2.x
>  A Pig UDF will be very helpful to me.
> How do I add a maven dependency with gradle?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-28) Tests are too slow

2016-10-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571961#comment-15571961
 ] 

Eyal Allweil commented on DATAFU-28:


On my machine the datafu-pig tests run in 18 minutes (I ran them with ./gradlew 
:datafu-pig:test). Is this issue still relevant, or is that an acceptable time?

> Tests are too slow
> --
>
> Key: DATAFU-28
> URL: https://issues.apache.org/jira/browse/DATAFU-28
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>
> I ran the tests on my laptop and it took nearly 2 hours.
> The worst offenders are {{datafu.test.pig.sampling}}, 
> {{datafu.test.pig.stats}}, and {{datafu.test.pig.stats.entropy}}.
> ||Package  ||Tests||  Failures||  Duration||  Success rate||
> |datafu.test.pig.bags|27  |0| 1m10.72s|100%|
> |datafu.test.pig.geo  |1  |0  |9.757s |100%|
> |datafu.test.pig.hash|4   |0  |41.039s|   100%|
> |datafu.test.pig.linkanalysis|5   |0| 32.677s |100%|
> |datafu.test.pig.random   |1| 0|  11.789s|100%|
> |datafu.test.pig.sampling |25|0   |38m25.81s| 100%|
> |datafu.test.pig.sessions |7  |0  |2m50.67s   |100%|
> |datafu.test.pig.sets |9  |0  |5m46.70s   |100%|
> |datafu.test.pig.stats|   52| 0   |26m11.98s| 100%|
> |datafu.test.pig.stats.entropy|40|0   |31m30.97s  |100%|
> |datafu.test.pig.urls|1   |0  |1m35.24s   |100%|
> |datafu.test.pig.util|21  |0| 4m51.64s|100%|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-85) Add SPRINTF to provide this functionality to Pig < 0.14.0

2016-10-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571787#comment-15571787
 ] 

Eyal Allweil commented on DATAFU-85:


Given the time that has passed, and that it can't be backported (easily), I 
think this issue can/should be closed.

> Add SPRINTF to provide this functionality to Pig < 0.14.0
> -
>
> Key: DATAFU-85
> URL: https://issues.apache.org/jira/browse/DATAFU-85
> Project: DataFu
>  Issue Type: Bug
>Reporter: Russell Jurney
>Assignee: Russell Jurney
>
> I need SPRINTF in DataFu for a book I'm working on. I'd like to add this to 
> DataFu so that CDH, HDP, MapR, etc. users can use SPRINTF as soon as DataFu 
> cuts a new release.
> See PIG-3939
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-122) Documentation error/typo on tips and tricks involving Coalesce

2016-10-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-122:

 Assignee: Eyal Allweil
   Labels: documentation typo  (was: docuentation typo)
Fix Version/s: 1.3.2

Thanks Ryan! I've fixed this in our sources, and it will show up when we 
release our next version (probably 1.3.2)

> Documentation error/typo on tips and tricks involving Coalesce
> --
>
> Key: DATAFU-122
> URL: https://issues.apache.org/jira/browse/DATAFU-122
> Project: DataFu
>  Issue Type: Bug
>Reporter: Ryan Clough
>Assignee: Eyal Allweil
>Priority: Trivial
>  Labels: documentation, typo
> Fix For: 1.3.2
>
>
> http://datafu.incubator.apache.org/docs/datafu/guide/more-tips-and-tricks.html
> On this page, an example is given for Coalesce:
> {code}
> DEFINE EmptyBagToNullFields datafu.pig.util.Coalesce();
> data = FOREACH data GENERATE Coalesce(val,0) as result;
> {code}
> In this example, "EmpyBagToNullFields" should be replaced with "Coalesce", 
> which is what is used in the code following the define statement. My guess is 
> this is a copy paste error from an example further down when 
> EmpyBagToNullFields is actually used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2016-09-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500764#comment-15500764
 ] 

Eyal Allweil commented on DATAFU-119:
-

I've run it on results that were in the tens of millions. I think the main 
reason for using it / including it in DataFu is that if you're developing Pig 
code, and running it on a cluster (or on any given environment), being able to 
stay in the Pig ecosystem is convenient for fast development cycles. If your 
original job can run on the given environment, a comparison job can run their 
efficiently, too. And there's less copying because you leave the previous 
results in the hdfs under a different name, and compare easily.

The output is human-readable, but the expected results is that most records 
return null, because they're identical, and the ones that do come out are 
usually edge cases that turned out different.

That's the reasoning behind having "something" like this UDF. The output type 
and it's not having a schema is a different story - it would be better to have 
a schema. But I'm hesitant to spend the time to do it if it isn't likely that 
someone else will want to write a different output format for it.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2016-09-07 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471164#comment-15471164
 ] 

Eyal Allweil commented on DATAFU-119:
-

Any feedback about this?

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2016-06-27 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350489#comment-15350489
 ] 

Eyal Allweil commented on DATAFU-119:
-

I put up a [reviewboard|https://reviews.apache.org/r/49248/] for this. After 
some internal discussions, I wonder if the output isn't too specific for 
general use - I find it very convenient during development for comparing 
outputs, but it's very much skewed towards human-readability - to make it easy 
to use the output in Pig, it should have a real schema, not chararray - 
possibly something with the field names from the original tuples, but boolean 
or int values to indicate change types. I'd be happy to hear feedback about 
this.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-06-08 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117-4.patch

This patch incorporates the last remaining comment from the review (clearing 
instead of reassigning the set in cleanup)

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, 
> DATAFU-117-4.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-05-19 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117-3.patch

Incorporates changes from [review |https://reviews.apache.org/r/46701/]

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (DATAFU-117) New UDF - CountDistinctUpTo

2016-05-09 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258239#comment-15258239
 ] 

Eyal Allweil edited comment on DATAFU-117 at 5/9/16 8:50 AM:
-

Ok, I opened a review board for it - It's at https://reviews.apache.org/r/46701/

I think all your previous comments are addressed there, except for the one 
about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this 
can exceed the max size, because a single add operation can only increment the 
set's size by one, and the UDF is executed in a single thread.

I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT 
followed by the builtin COUNT. On small inputs they perform about the same - 
even up to a million records - but if you have a situation with more skew (I 
checked 10 million records, with about 4 million distincts) then this UDF with 
a max value of say, 1,000,000, runs in a few minutes, and the nested 
foreach+distinct+count takes more than an hour - probably because it needs to 
keep all the distinct records in memory, rather than just reaching the desired 
threshold.


was (Author: eyal):
Ok, I opened a review board for it - can you see it? It's at 
https://reviews.apache.org/r/46701/

I think all your previous comments are addressed there, except for the one 
about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this 
can exceed the max size, because a single add operation can only increment the 
set's size by one, and the UDF is executed in a single thread.

I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT 
followed by the builtin COUNT. On small inputs they perform about the same - 
even up to a million records - but if you have a situation with more skew (I 
checked 10 million records, with about 4 million distincts) then this UDF with 
a max value of say, 1, runs in about four minutes, and the nested 
foreach+distinct+count takes more than an hour - probably because it needs to 
keep all the distinct records in memory, rather than just reaching the desired 
threshold.

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-04-26 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117-2.patch

This replaces the previous patch file, addresses (most of) Matthew's comments, 
and adds an Algebraic implementation to the UDF.

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-29 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215634#comment-15215634
 ] 

Eyal Allweil commented on DATAFU-115:
-

Thanks!

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-27 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213559#comment-15213559
 ] 

Eyal Allweil commented on DATAFU-115:
-

I'm not sure why, but I can't see this patch in the master branch. I can see 
https://issues.apache.org/jira/browse/DATAFU-114 - 
[FirstTupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/FirstTupleFromBag.java]
 appears changed - but 
[TupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/TupleFromBag.java]
 looks like it hasn't been changed since August. Does the public GitHub 
represent the repository accurately?

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-03-24 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117.patch

Patch including new UDF and test (in BagTests)

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-117) New UDF - CountDistinctUpTo

2016-03-24 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-117:
---

 Summary: New UDF - CountDistinctUpTo
 Key: DATAFU-117
 URL: https://issues.apache.org/jira/browse/DATAFU-117
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil


A UDF that counts distinct tuples within a bag, but only up to a preset limit. 
If the bag contains more distinct tuples than the limit, the UDF returns the 
limit. 

This UDF can run reasonably well even on large bags if the limit chosen is 
small enough though the count is done in memory.

We use this UDF in PayPal for filtering, when we don't need to use the actual 
tuples afterward.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-08 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185409#comment-15185409
 ] 

Eyal Allweil commented on DATAFU-116:
-

As far as I can tell, when the accumulator is used, Pig passes 
_pig.accumulative.batchsize_ tuples from each bag until all the tuples are 
exhausted. I think an implementation that iterates over the bags and only keeps 
some of the tuples in between batches is possible - hopefully very few, but the 
worst case is all of them, which is no worse than the current implementation.

I'm assuming Pig passes batches in this way based on the code in 
[POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java]
 and from looking through all the documentation I could find on accumulators. 
If I'm wrong it does mean that an accumulator implementation isn't worthwhile.

> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-08 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-116:
---

 Summary: Make SetIntersect and SetDifference implement Accumulator
 Key: DATAFU-116
 URL: https://issues.apache.org/jira/browse/DATAFU-116
 Project: DataFu
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Eyal Allweil


SetIntersect and SetDifference accept only sorted bags, and the output is 
always smaller than the inputs. Therefore an accumulator implementation should 
be possible and it will improve memory usage (somewhat) and allow Pig to 
optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-03 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-115:

Flags: Patch

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-03 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-115:

Attachment: DATAFU-115.patch

Relatively straightforward patch ... there's one difference from the previous 
behavior, that if an exception is thrown, I ignore it and try to continue 
iterating to the desired index.

I tried uploading it to the review board, see if [this 
link|https://reviews.apache.org/r/44351/] works.

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-03 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-115:
---

 Summary: Make TupleFromBag implement Accumulator
 Key: DATAFU-115
 URL: https://issues.apache.org/jira/browse/DATAFU-115
 Project: DataFu
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Eyal Allweil
Priority: Minor
 Fix For: 1.3.1


Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 

TupleFromBag doesn't need to hold the bag in memory, and can iterate through it 
until it reaches the desired tuple. By implementing Accumulator, larger bags 
can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-02-17 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150312#comment-15150312
 ] 

Eyal Allweil commented on DATAFU-114:
-

Thanks!

After I imported the projects individually, like you suggested, it works fine 
in Eclipse ... I suggest adding a sentence about it in the base readme file to 
help out future contributors

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Fix For: 1.3.1
>
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-02-04 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131991#comment-15131991
 ] 

Eyal Allweil commented on DATAFU-114:
-

Anyone?

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-01-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114990#comment-15114990
 ] 

Eyal Allweil commented on DATAFU-114:
-

Any comments? Can this patch be pulled?

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-01-14 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-114:
---

 Summary: Make FirstTupleFromBag implement Accumulator
 Key: DATAFU-114
 URL: https://issues.apache.org/jira/browse/DATAFU-114
 Project: DataFu
  Issue Type: Improvement
Affects Versions: 1.3.0
 Environment: All
Reporter: Eyal Allweil
Priority: Minor


FirstTupleFromBag only needs the first tuple from the bag, but because it 
doesn't implement Accumulator the entire bag needs to be passed to it 
in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-01-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-114:

Attachment: FirstTupleFromBag.java

I wasn't able to test this patch because I can't get the build working on my 
system (Ubuntu LTS) .. I'm getting the error described 
[here|https://issues.apache.org/jira/browse/DATAFU-95]. I can't seem to make 
Gradle use a different Java to get it to compile.

However, since the implementation of Accumulator is relatively straightforward, 
I hopefully haven't made any mistakes and I would appreciate if someone whose 
build is working tried it out and pulled the patch.

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-95) Improve wrong JDK error message

2016-01-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097858#comment-15097858
 ] 

Eyal Allweil commented on DATAFU-95:


As an immediate, easy-to-do improvement, writing what Java version is required 
in the main README on GitHub would be great.

> Improve wrong JDK error message
> ---
>
> Key: DATAFU-95
> URL: https://issues.apache.org/jira/browse/DATAFU-95
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Jakob Homan
>Priority: Minor
>
> Right now if one tries to build against JDK1.7, the resulting failure is a 
> bit unclear:
> {noformat}Download 
> https://repo1.maven.org/maven2/org/eclipse/equinox/app/1.3.200-v20130910-1609/app-1.3.200-v20130910-1609.jar
> /Users/jahoman/repos/datafu/build-plugin/src/main/java/org/adrianwalker/multilinestring/MultilineProcessor.java:18:
>  error: cannot find symbol
> @SupportedSourceVersion(SourceVersion.RELEASE_8)
>  ^
>   symbol:   variable RELEASE_8
>   location: class SourceVersion
> 1 error
> :build-plugin:compileJava FAILED
> FAILURE: Build failed with an exception.
> {noformat}
> It may be better to use something like [The 
> Sweeney|https://github.com/boxheed/gradle-sweeney-plugin] to enforce this and 
> provide a better, faster message on failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)