[jira] [Commented] (DATAFU-137) KEYS sigs and hashes must be linked from the download page
[ https://issues.apache.org/jira/browse/DATAFU-137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365386#comment-16365386 ] Eyal Allweil commented on DATAFU-137: - [~sheric], can you take a look at this? > KEYS sigs and hashes must be linked from the download page > -- > > Key: DATAFU-137 > URL: https://issues.apache.org/jira/browse/DATAFU-137 > Project: DataFu > Issue Type: Bug >Reporter: Sebb >Priority: Major > > The download page refers to verifying downloads, but leaves it up to the user > to find the sigs and hashes. > The sigs and hashes should be listed beside the files to which they apply. > See for example: > https://httpd.apache.org/download.cgi#apache24 > which uses a simple table. > URLs should use https, e.g. > https://www.apache.org/dist/incubator/datafu/apache-datafu-incubating-1.3.3/apache-datafu-incubating-sources-1.3.3.tgz.asc > The KEYS file URL should use https, and must use the ASF main host, i.e. > https://www.apache.org/dist/incubator/datafu/KEYS -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DATAFU-132) Make DataFu compile with Java 8
Eyal Allweil created DATAFU-132: --- Summary: Make DataFu compile with Java 8 Key: DATAFU-132 URL: https://issues.apache.org/jira/browse/DATAFU-132 Project: DataFu Issue Type: Improvement Affects Versions: 1.3.3 Reporter: Eyal Allweil Attachments: DATAFU-132.patch Currently DataFu only compiles with Java 7. It would be great if Java 8 could be used. The attached patch makes this possible by making the Java version check more permissive (anything about Java 7). I've done a build and tests, and deployed the resulting jar and tested a macro and UDF with it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-99) Can't build on Windows
[ https://issues.apache.org/jira/browse/DATAFU-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-99: --- Attachment: DATAFU-99.patch I've worked out an ugly fix for this problem that sidesteps the problem described in the bug. Basically, the file-locking-with-hashes-when-copying-within-the-same-folder problem exists in the [copy task|https://docs.gradle.org/3.5.1/userguide/working_with_files.html#sec:copying_files] because it attempts to be incremental. What I did was copy the gradle.properties file to a temporary folder, and then use the [copy method|https://docs.gradle.org/3.5.1/dsl/org.gradle.api.Project.html#org.gradle.api.Project:copy(org.gradle.api.Action)] to copy the changed file back into the original directory. Like I wrote at the beginning - ugly, but fixes the Windows build and it doesn't take long to copy a single small text file one extra time. > Can't build on Windows > -- > > Key: DATAFU-99 > URL: https://issues.apache.org/jira/browse/DATAFU-99 > Project: DataFu > Issue Type: Bug >Reporter: Matthew Hayes > Attachments: DATAFU-99.patch > > > [~ihadanny] reported that there is an issue building on Windows due to some > Gradle bug: > https://discuss.gradle.org/t/error-with-a-copy-task-on-windows/1803/3 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-99) Can't build on Windows
[ https://issues.apache.org/jira/browse/DATAFU-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16315193#comment-16315193 ] Eyal Allweil commented on DATAFU-99: Until this is solved completely, please note that building the projects separately *DOES work*. Instead of _./gradlew clean assemble_ You can use _./gradlew clean :datafu-pig:assemble_ or _./gradlew clean :datafu-hourglass:assemble_ > Can't build on Windows > -- > > Key: DATAFU-99 > URL: https://issues.apache.org/jira/browse/DATAFU-99 > Project: DataFu > Issue Type: Bug >Reporter: Matthew Hayes > > [~ihadanny] reported that there is an issue building on Windows due to some > Gradle bug: > https://discuss.gradle.org/t/error-with-a-copy-task-on-windows/1803/3 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil closed DATAFU-116. --- Resolution: Won't Fix Since it seems like Pig doesn't use the Accumulator interface when there are multiple bags in the input, this improvement isn't relevant for these UDF's. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-60) Support NDCG calculation within a UDF
[ https://issues.apache.org/jira/browse/DATAFU-60?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284984#comment-16284984 ] Eyal Allweil commented on DATAFU-60: Hi [~jhartman], I know it's been years, but do you think you'll get back to this? If not someone else will finish it off, since there isn't much left to do. > Support NDCG calculation within a UDF > - > > Key: DATAFU-60 > URL: https://issues.apache.org/jira/browse/DATAFU-60 > Project: DataFu > Issue Type: New Feature >Reporter: Joshua Hartman > Labels: features > Attachments: DATAFU-60-v2.patch, DATAFU-60.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > Ndcg is a common evaluation metric for the quality of a list of items > presented to a user. It is an especially common metric used in the search > literature. > This feature request is to implement a udf to calculate ndcg. > NDCG can be calculated using any function to represent the value of a > position. Several useful functions should be available as part of the datafu > library. First is the standard 1/logarithmic discounting factor. Another > option should be the ability to supply a custom positional value for any > range of positions in the case that a positional "value" is already well > understood. However, the actual discounting function used should be easily > pluggable in the event something custom is needed -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-47) UDF for Murmur3 (and other) Hash functions
[ https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-47: --- Attachment: DATAFU-47-new.patch I looked at the review board for this issue, and fixed the merge conflicts in HashTests and addressed the comments that were left. It depends on DATAFU-50, which was reopened, but I put a new patch there so that we can proceed with both. Since I didn't create the review, I can't upload a new diff there, but I've attached it to the Jira issue, and commented in the review board where appropriate. Tests pass, and I've run the content of "hasherTest" on a cluster using the assembled DataFu jar to make sure that the autojarring of the new Guava version works properly. I'll respond to the review board comments later. > UDF for Murmur3 (and other) Hash functions > -- > > Key: DATAFU-47 > URL: https://issues.apache.org/jira/browse/DATAFU-47 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer > Labels: Guava, Hash, UDF > Attachments: > 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch, > 0001-UDF-for-Murmur3-and-other-Hash-functions.patch, DATAFU-47-new.patch > > > Datafu should offer the murmur3 hash. > The attached patch uses Guava to add murmur3 (a fast hash with good > statistical properties), SipHash-2-4 (a fast cryptographically secure hash), > crc32, adler32, md5 and sha. > From the javadoc: > * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a > [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. > Murmur3 is fast, with has exceptionally good statistical properties; it's a > good choice if all you need is good mixing of the inputs. It is _not_ > cryptographically secure; that is, given an output value from murmur3, there > are efficient algorithms to find an input yielding the same output value. > Supply the seed as a string that > [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)] > can handle. > * 'sip24', [optional seed]: Returns a [64-bit > SipHash-2-4|https://131002.net/siphash/]. SipHash is competitive in > performance with Murmur3, and is simpler and faster than the cryptographic > algorithms below. When used with a seed, it can be considered > cryptographically secure: given the output from a sip24 instance but not the > seed used, we cannot efficiently craft a message yielding the same output > from that instance. > * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to > Java's Adler32 Checksum > * 'crc32': Returns a CRC-32 checksum (32 hash bits) by delegating to Java's > CRC32 Checksum. > * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 > MessageDigest. > * 'sha1':Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 > MessageDigest. > * 'sha256': Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 > MessageDigest. > * 'sha512': Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 > MessageDigest. > * 'good-(integer number of bits)': Returns a general-purpose, > non-cryptographic-strength, streaming hash function that produces hash codes > of length at least minimumBits. Users without specific compatibility > requirements and who do not persist the hash codes are encouraged to choose > this hash function. (Cryptographers, like dieticians and fashionistas, > occasionally realize that We've Been Doing it Wrong This Whole Time. Using > 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To > (Wear|Eat|Hash With) this Fall.) Values for this hash will change from run to > run. > Examples: > {code} > define DefaultHdatafu.pig.hash.Hasher(); > define GoodH datafu.pig.hash.Hasher('good-32'); > define BetterH datafu.pig.hash.Hasher('good-127'); > define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); > define MurmurH32A datafu.pig.hash.Hasher('murmur3-32', '0x0'); > define MurmurH32B datafu.pig.hash.Hasher('murmur3-32', '0x56789abc'); > define MurmurH128 datafu.pig.hash.Hasher('murmur3-128'); > define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0'); > define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678'); > define MD5Hdatafu.pig.hash.Hasher('md5'); > define SHA1H datafu.pig.hash.Hasher('sha1'); > define SHA256H datafu.pig.hash.Hasher('sha256'); > define SHA512H datafu.pig.hash.Hasher('sha512'); > > data_in = LOAD 'input' as (val:chararray); > > data_out = FOREACH data_in GENERATE > DefaultH(val), GoodH(val), BetterH(val), > MurmurH32(val), MurmurH32A(val), MurmurH32B(val), > MurmurH128(val), MurmurH128A(val), MurmurH128B(val), > SHA1H(val), SHA256H(val),
[jira] [Commented] (DATAFU-30) Website crawl errors for class use links
[ https://issues.apache.org/jira/browse/DATAFU-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273148#comment-16273148 ] Eyal Allweil commented on DATAFU-30: I think newer Javadoc versions don't have a "Use" button, so this error doesn't happen anymore ... I looked at 1.3.0 and 1.3.1. If that's true, I suggest closing this issue, since fixing an older version's javadoc seems unimportant to me. > Website crawl errors for class use links > > > Key: DATAFU-30 > URL: https://issues.apache.org/jira/browse/DATAFU-30 > Project: DataFu > Issue Type: Bug >Reporter: Matthew Hayes > Attachments: > datafu-incubator-apache-org_20140131T042403Z_CrawlErrors.csv > > > Google webmaster tools has reported crawl errors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (DATAFU-118) Automatically run rat task when running assemble
[ https://issues.apache.org/jira/browse/DATAFU-118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil reassigned DATAFU-118: --- Assignee: Eyal Allweil > Automatically run rat task when running assemble > > > Key: DATAFU-118 > URL: https://issues.apache.org/jira/browse/DATAFU-118 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Eyal Allweil >Priority: Minor > > The rat task checks that our files have the right headers. We don't > automatically run it for assemble so it isn't easy for new contributors to > catch issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-125) Upgrade Gradle to v4 or later
[ https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-125. - Resolution: Fixed Merged - everything looks fine to me. I repeated these tests, and tried out the jar with a simple Pig script. I also added the three deleted files to our _.gitignore_, as [~cur4so] suggested (in a different Jira issue). > Upgrade Gradle to v4 or later > - > > Key: DATAFU-125 > URL: https://issues.apache.org/jira/browse/DATAFU-125 > Project: DataFu > Issue Type: Task >Reporter: Matthew Hayes >Assignee: Eyal Allweil > Attachments: DATAFU-125-v2.patch, DATAFU-125.patch > > > We should update to the most recent version of Gradle. We're currently using > 2.4. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number
[ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258480#comment-16258480 ] Eyal Allweil commented on DATAFU-63: I wonder if the gradlew script is there because it can be "just used" on some systems. On one computer I couldn't find a gradle installation at all - maybe I skipped the bootstrapping step and the script worked for me. I'm no gradle expert so I don't know if this is feasible. If not, it sounds like a good idea to open a new issue for removing the gradlew, following your suggestion. I looked at the code you linked, and I'll try again to answer your questions (in this issue and in the comments in the code). The way the Algebraic interface works in Pig is that Pig calls EvalFunc.getInitial to get the name of the (inner) class with the Initial implementation, and creates it. You can see this code in POUserFunc in the Apache Pig project, it's not part of DataFu. The Initial class is then instantiated, and the exec there is called. The output is gathered and then the same process happens for Intermediate, and the output from Initial becomes the input for Intermediate. Finally, all these outputs from various mappers are sent to a reducer and the getFinal method is used to instantiate the Final class. This means the implementations you've suggested would only work if this UDF was implementing the EvalFunc or AccumulatorEvalFunc interfaces. But in that case we already have a UDF that gives us a comparable implementation in DataFu - [ReservoirSample|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/sampling/ReservoirSample.java], so I don't think it's what is desired in this particular Jira issue. This brings us back to what Matthew wrote way back in 2014 - implementing algorithm 6 from [this paper|http://jmlr.org/proceedings/papers/v28/meng13a.pdf]. I think what that algorithm provides is a way to generate a _k_-sized sample without ever holding all k items in memory in the mapper (which is the limitation for ReservoirSample, it won't work if _k_ is too big). Does that make sense? > SimpleRandomSample by a fixed number > > > Key: DATAFU-63 > URL: https://issues.apache.org/jira/browse/DATAFU-63 > Project: DataFu > Issue Type: New Feature >Reporter: jian wang >Assignee: jian wang > > SimpleRandomSample currently supports random sampling by probability, it does > not support random sample a fixed number of items. ReserviorSample may do the > work but since it relies on an in-memory priority queue, memory issue may > happen if we are going to sample a huge number of items, eg: sample 100M from > 100G data. > Suggested approach is to create a new class "SimpleRandomSampleByCount" that > uses Manuver's rejection threshold to reject items whose weight exceeds the > threshold as we go from mapper to combiner to reducer. The majority part of > the algorithm will be very similar to SimpleRandomSample, except that we do > not use Berstein's theory to accept items and replace probability p = k / n, > k is the number of items to sample, n is the total number of items local in > mapper, combiner and reducer. > Quote this requirement from others: > "Hi folks, > Question: does anybody know if there is a quicker way to randomly sample a > specified number of rows from grouped data? I’m currently doing this, since > it appears that the SAMPLE operator doesn’t work inside FOREACH statements: > photosGrouped = GROUP photos BY farm; > agg = FOREACH photosGrouped { > rnds = FOREACH photos GENERATE *, RANDOM() as rnd; > ordered_rnds = ORDER rnds BY rnd; > limitSet = LIMIT ordered_rnds 5000; > GENERATE group AS farm, >FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, > secret); > }; > This approach seems clumsy, and appears to run quite slowly (I’m assuming the > ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do > this? > Thanks, > " -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-130) Add left outer join macro described in the DataFu guide
[ https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253564#comment-16253564 ] Eyal Allweil commented on DATAFU-130: - Hi [~varunu28], Thank you for your interest! Do you have any experience with Pig? I can't seem to assign this issue to you, but that isn't really necessary to work on it. Go right ahead! For setting up your environment, you should follow [this guide:|http://datafu.incubator.apache.org/community/contributing.htm] We haven't published instructions for how to contribute Pig macros. I'll try to write a rough draft of a guide and email it here or put it up on our wiki. In the meantime, you can look at [count_macros.pig|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/resources/datafu/count_macros.pig] for an example of a macro file (though all you need to do is copy the macro in the Jira to the macros directory), and [MacroTests.java|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/test/java/datafu/test/pig/macros/MacroTests.java] for an example of a test. You can add your test to this file, actually. For a guide to how to prepare a patch file, you can look [here|https://cwiki.apache.org/confluence/display/DATAFU/Contributing+to+Apache+DataFu]. > Add left outer join macro described in the DataFu guide > --- > > Key: DATAFU-130 > URL: https://issues.apache.org/jira/browse/DATAFU-130 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Labels: macro, newbie > > In our > [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a > macro is described for making a three-way left outer join conveniently. We > can add this macro to DataFu to make it even easier to use. > The macro's code is as follows: > {noformat} > DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) > returns joined { > cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY > $key3; > $joined = FOREACH cogrouped GENERATE > FLATTEN($relation1), > FLATTEN(EmptyBagToNullFields($relation2)), > FLATTEN(EmptyBagToNullFields($relation3)); > } > {noformat} > (we would obviously want to add a test for this, too) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-63) SimpleRandomSample by a fixed number
[ https://issues.apache.org/jira/browse/DATAFU-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16249793#comment-16249793 ] Eyal Allweil commented on DATAFU-63: Hi [~cur4so], I'll quickly answer your last comment - I'll get to the previous one as soon as I can. We do indeed use still gradle 2.4 in the master branch. We're [about to update to Gradle 3.5|https://issues.apache.org/jira/browse/DATAFU-125], but it hasn't been merged yet. However, when I did the gradle bootstrapping, it didn't modify my _gradlew_ file - what OS are you on? (BTW - we can't add it to the gitignore because it's checked into the repository, and you can't ignore files that are checked in) > SimpleRandomSample by a fixed number > > > Key: DATAFU-63 > URL: https://issues.apache.org/jira/browse/DATAFU-63 > Project: DataFu > Issue Type: New Feature >Reporter: jian wang >Assignee: jian wang > > SimpleRandomSample currently supports random sampling by probability, it does > not support random sample a fixed number of items. ReserviorSample may do the > work but since it relies on an in-memory priority queue, memory issue may > happen if we are going to sample a huge number of items, eg: sample 100M from > 100G data. > Suggested approach is to create a new class "SimpleRandomSampleByCount" that > uses Manuver's rejection threshold to reject items whose weight exceeds the > threshold as we go from mapper to combiner to reducer. The majority part of > the algorithm will be very similar to SimpleRandomSample, except that we do > not use Berstein's theory to accept items and replace probability p = k / n, > k is the number of items to sample, n is the total number of items local in > mapper, combiner and reducer. > Quote this requirement from others: > "Hi folks, > Question: does anybody know if there is a quicker way to randomly sample a > specified number of rows from grouped data? I’m currently doing this, since > it appears that the SAMPLE operator doesn’t work inside FOREACH statements: > photosGrouped = GROUP photos BY farm; > agg = FOREACH photosGrouped { > rnds = FOREACH photos GENERATE *, RANDOM() as rnd; > ordered_rnds = ORDER rnds BY rnd; > limitSet = LIMIT ordered_rnds 5000; > GENERATE group AS farm, >FLATTEN(limitSet.(photo_id, server, secret)) AS (photo_id, server, > secret); > }; > This approach seems clumsy, and appears to run quite slowly (I’m assuming the > ORDER/LIMIT isn’t great for performance). Is there a less awkward way to do > this? > Thanks, > " -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-48) Upgrade Guava to 20.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-48. Resolution: Fixed Assignee: Eyal Allweil (was: Philip (flip) Kromer) Fix Version/s: 1.3.3 Merged > Upgrade Guava to 20.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Eyal Allweil >Priority: Minor > Labels: build, dependency, guava, version > Fix For: 1.3.3 > > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-48) Upgrade Guava to 20.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-48: --- Summary: Upgrade Guava to 20.0 (was: Upgrade Guava to 17.0) > Upgrade Guava to 20.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later
[ https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16220371#comment-16220371 ] Eyal Allweil commented on DATAFU-125: - When I build with _./gradlew clean release -Prelease=true_, I don't get a zip file. In fact, I don't get jars either - I need to use _assemble_ to make them (both on master with and without upgrading Gradle). Am I using the wrong command? Does it work for you, [~matterhayes]? > Upgrade Gradle to v4 or later > - > > Key: DATAFU-125 > URL: https://issues.apache.org/jira/browse/DATAFU-125 > Project: DataFu > Issue Type: Task >Reporter: Matthew Hayes >Assignee: Eyal Allweil > Attachments: DATAFU-125.patch > > > We should update to the most recent version of Gradle. We're currently using > 2.4. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (DATAFU-131) Update DataFu site to meet graduation requirements
[ https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil reassigned DATAFU-131: --- Assignee: Matthew Hayes > Update DataFu site to meet graduation requirements > -- > > Key: DATAFU-131 > URL: https://issues.apache.org/jira/browse/DATAFU-131 > Project: DataFu > Issue Type: Bug >Reporter: Eyal Allweil >Assignee: Matthew Hayes > Attachments: DATAFU-131.patch, Screen Shot 2017-10-25 at 7.21.09 > PM.png > > > The following issues were raised with the [DataFu web > site|http://datafu.incubator.apache.org] as part of the [graduation > discussion on the incubator general maiing > list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E] > There's no link to the main ASF website. > There's no LICENSE or Thanks link. > There's no download link. > etc. > The quick start guide pages do have download links, but the primary > link is to Maven rather than the ASF, and there are no instructions as > to how to check sigs or hashes, and no link to the KEYS file that I > could find. > The SHA-512 checksum must have the extension .sha512 > http://www.apache.org/dev/release-distribution.html#sigs-and-sums > Also the latest release appears to be 1.3.2 (dated Feb 2017) but the > download links point to 1.3.1. > The older releases (1.3.1 and 1.3.0) should have been deleted from the > release/dist directory by now. > There's no Apache feather logo which is often used as the link to the > main ASF site. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-17) Improve testing of randomized functions
[ https://issues.apache.org/jira/browse/DATAFU-17?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214398#comment-16214398 ] Eyal Allweil commented on DATAFU-17: I think we can close this, just as we closed [DATAFU-28|https://issues.apache.org/jira/browse/DATAFU-28]. If all the tests take less than twenty minutes now I don't think it's worth making an effort to minimize the randomized functions. > Improve testing of randomized functions > --- > > Key: DATAFU-17 > URL: https://issues.apache.org/jira/browse/DATAFU-17 > Project: DataFu > Issue Type: Improvement >Reporter: Will Vaughan > > We have a large number of UDFs with a random component that are difficult and > often slow to test. We should improve our testing standards and capabilities > for this class of functions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214212#comment-16214212 ] Eyal Allweil commented on DATAFU-48: As an additional check, I ran a Pig script which uses _SimpleRandomSampleWithReplacementVote_ (which uses Guava) to see that it still runs correctly. > Upgrade Guava to 17.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later
[ https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214208#comment-16214208 ] Eyal Allweil commented on DATAFU-125: - _check_ and _clean release_ run and return SUCCESS. Are there any special files I should check that are the result of the _release_ task? I also ran a script on the packaged jar (the regular one, not core or the jarjar) and it ran fine. > Upgrade Gradle to v4 or later > - > > Key: DATAFU-125 > URL: https://issues.apache.org/jira/browse/DATAFU-125 > Project: DataFu > Issue Type: Task >Reporter: Matthew Hayes > Attachments: DATAFU-125.patch > > > We should update to the most recent version of Gradle. We're currently using > 2.4. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-118) Automatically run rat task when running assemble
[ https://issues.apache.org/jira/browse/DATAFU-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16211505#comment-16211505 ] Eyal Allweil commented on DATAFU-118: - (because we have a patch that seems to work on a newer Gradle version linked in the review board) > Automatically run rat task when running assemble > > > Key: DATAFU-118 > URL: https://issues.apache.org/jira/browse/DATAFU-118 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > > The rat task checks that our files have the right headers. We don't > automatically run it for assemble so it isn't easy for new contributors to > catch issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-32) Hourglass concrete jobs should have getters and setters for output name and namespace
[ https://issues.apache.org/jira/browse/DATAFU-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210941#comment-16210941 ] Eyal Allweil commented on DATAFU-32: Is this still relevant? If so, I'll open a [Help Wanted task|https://helpwanted.apache.org/] for it. > Hourglass concrete jobs should have getters and setters for output name and > namespace > - > > Key: DATAFU-32 > URL: https://issues.apache.org/jira/browse/DATAFU-32 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Matthew Hayes > > With the abstract versions you can override getOutputSchemaName() and > getOutputSchemaNamespace(). But the concrete versions don't expose setters, > so you have to extend the class to override the defaults. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210769#comment-16210769 ] Eyal Allweil commented on DATAFU-48: None, actually. Hadoop 1 and 2 are using 11.0.2, like us. Hadoop 3 is [using 21|https://issues.apache.org/jira/browse/HADOOP-10101]. > Upgrade Guava to 17.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-131) Update DataFu site to meet graduation requirements
[ https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16199209#comment-16199209 ] Eyal Allweil commented on DATAFU-131: - Here's a link to the Apache site guidelines: https://www.apache.org/foundation/marks/pmcs#navigation > Update DataFu site to meet graduation requirements > -- > > Key: DATAFU-131 > URL: https://issues.apache.org/jira/browse/DATAFU-131 > Project: DataFu > Issue Type: Bug >Reporter: Eyal Allweil > > The following issues were raised with the [DataFu web > site|http://datafu.incubator.apache.org] as part of the [graduation > discussion on the incubator general maiing > list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E] > There's no link to the main ASF website. > There's no LICENSE or Thanks link. > There's no download link. > etc. > The quick start guide pages do have download links, but the primary > link is to Maven rather than the ASF, and there are no instructions as > to how to check sigs or hashes, and no link to the KEYS file that I > could find. > The SHA-512 checksum must have the extension .sha512 > http://www.apache.org/dev/release-distribution.html#sigs-and-sums > Also the latest release appears to be 1.3.2 (dated Feb 2017) but the > download links point to 1.3.1. > The older releases (1.3.1 and 1.3.0) should have been deleted from the > release/dist directory by now. > There's no Apache feather logo which is often used as the link to the > main ASF site. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-131) Update DataFu site to meet graduation requirements
Eyal Allweil created DATAFU-131: --- Summary: Update DataFu site to meet graduation requirements Key: DATAFU-131 URL: https://issues.apache.org/jira/browse/DATAFU-131 Project: DataFu Issue Type: Bug Reporter: Eyal Allweil The following issues were raised with the [DataFu web site|http://datafu.incubator.apache.org] as part of the [graduation discussion on the incubator general maiing list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E] There's no link to the main ASF website. There's no LICENSE or Thanks link. There's no download link. etc. The quick start guide pages do have download links, but the primary link is to Maven rather than the ASF, and there are no instructions as to how to check sigs or hashes, and no link to the KEYS file that I could find. The SHA-512 checksum must have the extension .sha512 http://www.apache.org/dev/release-distribution.html#sigs-and-sums Also the latest release appears to be 1.3.2 (dated Feb 2017) but the download links point to 1.3.1. The older releases (1.3.1 and 1.3.0) should have been deleted from the release/dist directory by now. There's no Apache feather logo which is often used as the link to the main ASF site. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-87) Edit distance
[ https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197120#comment-16197120 ] Eyal Allweil commented on DATAFU-87: On second thought, since this UDF is now available in Hive, and since Levenshtein distance is a purely local computation, I'm guessing there's no need for a specific DataFu implementation. Shall we close this issue? Here are some links to the Hive UDF. https://issues.apache.org/jira/browse/HIVE-9556 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions > Edit distance > - > > Key: DATAFU-87 > URL: https://issues.apache.org/jira/browse/DATAFU-87 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Joydeep Banerjee > Attachments: DATAFU-87.patch > > > [This is work-in-progress] > Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) > between them. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-48) Upgrade Guava to 17.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-48: --- Attachment: DATAFU-48-update-gradle-to-20.0.patch I checked, and Guava 20.0 is the last version that we can update to without getting into a Java version conflict. So this is a patch that updates Guava to 20.0. The tests all pass (build plugin, hourglass, and pig) and I ran a simple Pig script that uses the generated DataFu pig jar to see that it's still valid. Let's close this ancient ticket! > Upgrade Guava to 17.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL
[ https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196058#comment-16196058 ] Eyal Allweil commented on DATAFU-12: [~matterhayes], anyone, what do you think? I wouldn't "waste" our time on something that can already be done in Pig via Hive, and I'd like to close jira's that are no longer relevant. > Implement Lead UDF based on version from SQL > > > Key: DATAFU-12 > URL: https://issues.apache.org/jira/browse/DATAFU-12 > Project: DataFu > Issue Type: New Feature >Reporter: Matthew Hayes > > Min Zhou has provided this suggestion ([Issue #88 on > GitHub|https://github.com/linkedin/datafu/pull/88]): > Lead is an analytic function like Oracle's Lead function. It provides access > to more than one tuple of a bag at the same time without a self join. Given a > bag of tuple returned from a query, LEAD provides access to a tuple at a > given physical offset beyond that position. Generates pairs of all items in a > bag. > If you do not specify offset, then its default is 1. Null is returned if the > offset goes beyond the scope of the bag. > Example 1: > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead('2'); >-- INPUT: ({(1),(2),(3),(4)}) >data = LOAD 'input' AS (data: bag {T: tuple(v:INT)}); >describe data; >-- OUTPUT: ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)}) >-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: > int),elem2: (v: int))}} >data2 = FOREACH data GENERATE Lead(data); >describe data2; >DUMP data2; > {noformat} > Example 2 > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead(); >-- INPUT: > ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})}) >data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: > tuple(v2:INT)})}); >--describe data; >-- OUPUT: > ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)}) >data2 = FOREACH data GENERATE Lead(data); >--describe data2; >DUMP data2; > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-130) Add left outer join macro described in the DataFu guide
[ https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-130: Description: In our [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a macro is described for making a three-way left outer join conveniently. We can add this macro to DataFu to make it even easier to use. The macro's code is as follows: {noformat} DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined { cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3; $joined = FOREACH cogrouped GENERATE FLATTEN($relation1), FLATTEN(EmptyBagToNullFields($relation2)), FLATTEN(EmptyBagToNullFields($relation3)); } {noformat} (we would obviously want to add a test for this, too) was: In our [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a macro is described for making a three-way left outer join conveniently. We can add this macro to DataFu to make it even easier to use. The macro's code is as follows: {noformat} DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined { cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3; $joined = FOREACH cogrouped GENERATE FLATTEN($relation1), FLATTEN(EmptyBagToNullFields($relation2)), FLATTEN(EmptyBagToNullFields($relation3)); } (we would obviously want to add a test for this, too) {noformat} > Add left outer join macro described in the DataFu guide > --- > > Key: DATAFU-130 > URL: https://issues.apache.org/jira/browse/DATAFU-130 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Labels: macro, newbie > > In our > [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a > macro is described for making a three-way left outer join conveniently. We > can add this macro to DataFu to make it even easier to use. > The macro's code is as follows: > {noformat} > DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) > returns joined { > cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY > $key3; > $joined = FOREACH cogrouped GENERATE > FLATTEN($relation1), > FLATTEN(EmptyBagToNullFields($relation2)), > FLATTEN(EmptyBagToNullFields($relation3)); > } > {noformat} > (we would obviously want to add a test for this, too) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-130) Add left outer join macro described in the DataFu guide
[ https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169225#comment-16169225 ] Eyal Allweil commented on DATAFU-130: - I think this is a good Jira issue to put in the [Apache Help Wanted site|https://helpwanted.apache.org/]. If there's no objection, I'll add it there. > Add left outer join macro described in the DataFu guide > --- > > Key: DATAFU-130 > URL: https://issues.apache.org/jira/browse/DATAFU-130 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Labels: macro, newbie > > In our > [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a > macro is described for making a three-way left outer join conveniently. We > can add this macro to DataFu to make it even easier to use. > The macro's code is as follows: > {noformat} > DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) > returns joined { > cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY > $key3; > $joined = FOREACH cogrouped GENERATE > FLATTEN($relation1), > FLATTEN(EmptyBagToNullFields($relation2)), > FLATTEN(EmptyBagToNullFields($relation3)); > } > (we would obviously want to add a test for this, too) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-61. Resolution: Fixed Assignee: Eyal Allweil Merged. > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165991#comment-16165991 ] Eyal Allweil commented on DATAFU-61: Yes, I'll merge it. I did respond to an open issue in the review request that I only just noticed, something about using COUNT vs. SUM when calculating the IDF part ... as far as I can tell, the existing code is OK but it wouldn't hurt if you or Russell want to take a look at it. > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165925#comment-16165925 ] Eyal Allweil commented on DATAFU-119: - The documentation can be part of [DATAFU-128|https://issues.apache.org/jira/browse/DATAFU-128]. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Attachments: DATAFU-119-2.patch > > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-119: Attachment: DATAFU-119-2.patch > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Attachments: DATAFU-119-2.patch > > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164373#comment-16164373 ] Eyal Allweil commented on DATAFU-61: One last thing - I noticed after I uploaded my patch that it has my email, but I think it would be better for it to have your email, [~russell.jurney], since all I did was write the test. Is it OK that I replace my email with yours before committing this, so we get a (more accurate) "eyal committed with russell" type commit? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-130) Add left outer join macro described in the DataFu guide
Eyal Allweil created DATAFU-130: --- Summary: Add left outer join macro described in the DataFu guide Key: DATAFU-130 URL: https://issues.apache.org/jira/browse/DATAFU-130 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil In our [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a macro is described for making a three-way left outer join conveniently. We can add this macro to DataFu to make it even easier to use. The macro's code is as follows: {noformat} DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined { cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3; $joined = FOREACH cogrouped GENERATE FLATTEN($relation1), FLATTEN(EmptyBagToNullFields($relation2)), FLATTEN(EmptyBagToNullFields($relation3)); } (we would obviously want to add a test for this, too) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-128) Add documentation for macros
[ https://issues.apache.org/jira/browse/DATAFU-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162936#comment-16162936 ] Eyal Allweil commented on DATAFU-128: - Is the documentation for updating the website accurate? There are references to svn in there, which lead me to think they might not be relevant anymore ... > Add documentation for macros > > > Key: DATAFU-128 > URL: https://issues.apache.org/jira/browse/DATAFU-128 > Project: DataFu > Issue Type: Improvement >Reporter: Eyal Allweil > > Now that it is possible to add Pig macros to Datafu, we should update the > documentation to reflect this, and provide guidelines and point would-be > contributors to examples. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-129) New macro - dedup
[ https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-129: Attachment: DATAFU-129.patch Macro and test > New macro - dedup > - > > Key: DATAFU-129 > URL: https://issues.apache.org/jira/browse/DATAFU-129 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-129.patch > > > Macro used to dedup (de-duplicate) a table, based on a key or keys and an > ordering (typically a date updated field). > One thing to consider - the implementation relies on the > ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test > dependencies in order for the test to run. While I feel that anyone using Pig > typically has PiggyBank in the classpath, this might not be true - do we have > an alternative? (maybe adding it to the jarjar?) > The macro's definition looks as follows: > DEFINE dedup(relation, row_key, order_field) returns out { > relation - relation to dedup > row_key - field(s) for group by > order_field - the field for ordering (to find the most recent record) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-129) New macro - dedup
Eyal Allweil created DATAFU-129: --- Summary: New macro - dedup Key: DATAFU-129 URL: https://issues.apache.org/jira/browse/DATAFU-129 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil Assignee: Eyal Allweil Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically a date updated field). One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test dependencies in order for the test to run. While I feel that anyone using Pig typically has PiggyBank in the classpath, this might not be true - do we have an alternative? (maybe adding it to the jarjar?) The macro's definition looks as follows: DEFINE dedup(relation, row_key, order_field) returns out { relation - relation to dedup row_key - field(s) for group by order_field - the field for ordering (to find the most recent record) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-128) Add documentation for macros
Eyal Allweil created DATAFU-128: --- Summary: Add documentation for macros Key: DATAFU-128 URL: https://issues.apache.org/jira/browse/DATAFU-128 Project: DataFu Issue Type: Improvement Reporter: Eyal Allweil Now that it is possible to add Pig macros to Datafu, we should update the documentation to reflect this, and provide guidelines and point would-be contributors to examples. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-127) New macro - samply by keys
[ https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-127: Attachment: DATAFU-127.patch Patch including new macros and tests > New macro - samply by keys > -- > > Key: DATAFU-127 > URL: https://issues.apache.org/jira/browse/DATAFU-127 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-127.patch > > > Two macros that return a sample of a larger table based on a list of keys, > with the schema of the larger table. One of the macros filters by dates, the > other doesn't. > If there are multiple rows with a key that appears in the key list, all of > them will be returned (no deduplication is done). The results are returned > ordered by the key field in a single file. > The implementation uses a replicated join for efficiency, but this means the > key list shouldn't be too large as to not fit in memory. > The first macro's definition looks as follows: > DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) > returns out { > - table_name - table name to sample > - sample_set - a set of keys > - join_key_table - join column name in the table > - join_key_sample - join column name in the sample -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-127) New macro - samply by keys
Eyal Allweil created DATAFU-127: --- Summary: New macro - samply by keys Key: DATAFU-127 URL: https://issues.apache.org/jira/browse/DATAFU-127 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil Assignee: Eyal Allweil Two macros that return a sample of a larger table based on a list of keys, with the schema of the larger table. One of the macros filters by dates, the other doesn't. If there are multiple rows with a key that appears in the key list, all of them will be returned (no deduplication is done). The results are returned ordered by the key field in a single file. The implementation uses a replicated join for efficiency, but this means the key list shouldn't be too large as to not fit in memory. The first macro's definition looks as follows: DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) returns out { - table_name- table name to sample - sample_set- a set of keys - join_key_table- join column name in the table - join_key_sample - join column name in the sample -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-126) There is a typo in document
[ https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-126. - Resolution: Fixed > There is a typo in document > --- > > Key: DATAFU-126 > URL: https://issues.apache.org/jira/browse/DATAFU-126 > Project: DataFu > Issue Type: Improvement >Reporter: Kane >Assignee: Eyal Allweil >Priority: Minor > > It should be "functions" not "functiosn" in the document page > https://datafu.incubator.apache.org/docs/datafu/guide.html. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-126) There is a typo in document
[ https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161252#comment-16161252 ] Eyal Allweil commented on DATAFU-126: - Thanks Kane! I've fixed this in our sources, and it will show up when we release our next version. > There is a typo in document > --- > > Key: DATAFU-126 > URL: https://issues.apache.org/jira/browse/DATAFU-126 > Project: DataFu > Issue Type: Improvement >Reporter: Kane >Assignee: Eyal Allweil >Priority: Minor > > It should be "functions" not "functiosn" in the document page > https://datafu.incubator.apache.org/docs/datafu/guide.html. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (DATAFU-126) There is a typo in document
[ https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil reassigned DATAFU-126: --- Assignee: Eyal Allweil > There is a typo in document > --- > > Key: DATAFU-126 > URL: https://issues.apache.org/jira/browse/DATAFU-126 > Project: DataFu > Issue Type: Improvement >Reporter: Kane >Assignee: Eyal Allweil >Priority: Minor > > It should be "functions" not "functiosn" in the document page > https://datafu.incubator.apache.org/docs/datafu/guide.html. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible
[ https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161211#comment-16161211 ] Eyal Allweil commented on DATAFU-83: By the way, [~ItsAUsernameRight?], if you're already looking at InUDF, and you'd like another contribution afterwards, you can also look at [DATAFU-80|https://issues.apache.org/jira/browse/DATAFU-80] - it's another small change to improve InUDF's behavior. (you can ignore the second part of that issue, which deals with Java versions). > InUDF does not validate that types are compatible > - > > Key: DATAFU-83 > URL: https://issues.apache.org/jira/browse/DATAFU-83 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > Attachments: DATAFU-83.patch, rb36702.patch > > > See the example below. The input data is a long, but ints are provided to > match against. Because it uses the Java equals to compare and these are > different types, this will never match, which can lead to confusing results. > I believe it should at least throw an error. > {code} > define I datafu.pig.util.InUDF(); > > data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)}); > > data2 = FOREACH data { > C = FILTER B By I(v, 1,2,3); > GENERATE C; > } > > describe data2; > > STORE data2 INTO 'output'; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161118#comment-16161118 ] Eyal Allweil commented on DATAFU-61: Came back to this today and tried a little experiment - I verified (calculating manually) that the Russell's code produces the same results as the "augmented TF" IDF flavor for the sample I took from the wikipedia page. Is that good enough for us? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-61: --- Attachment: DATAFU-61-2.patch Now that macros are supported (and can be tested), I updated this patch. Unfortunately, I couldn't find the sample data, so I just pulled the sample sentences from the Wikipedia page for TF-IDF, and I didn't verify that the results are OK. [~russell.jurney] - want to donate a test case and expected results? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115744#comment-16115744 ] Eyal Allweil commented on DATAFU-119: - [~matterhayes] - We want the Apache license header on our macro files too, right? If so, I'll add it to the sample macro from [DATAFU-123|https://issues.apache.org/jira/browse/DATAFU-123] as well. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods
[ https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16067884#comment-16067884 ] Eyal Allweil commented on DATAFU-124: - I reviewed it - looks fine, a nice improvement. I'll try to get it committed soon (unless of course someone has any actionable comments) > sessionize() ought to support millisecond periods > - > > Key: DATAFU-124 > URL: https://issues.apache.org/jira/browse/DATAFU-124 > Project: DataFu > Issue Type: Bug >Reporter: Jacob Tolar > > The sessionize UDF should support a period in milliseconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL
[ https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972397#comment-15972397 ] Eyal Allweil commented on DATAFU-12: It looks like this functionality is implemented in HIve - see the following two links: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-LEADusingdefault1rowleadandnotspecifyingdefaultvalue https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLead.java Since Pig now supports using Hive UDF's, I think this Jira can be closed. Alternately, if we want to provide a DataFu implementation, I'll copy the proposed patch and discussion from the Github issue mentioned in the description, so it's easier for a possible-implementer to continue where work stalled. > Implement Lead UDF based on version from SQL > > > Key: DATAFU-12 > URL: https://issues.apache.org/jira/browse/DATAFU-12 > Project: DataFu > Issue Type: New Feature >Reporter: Matthew Hayes > > Min Zhou has provided this suggestion ([Issue #88 on > GitHub|https://github.com/linkedin/datafu/pull/88]): > Lead is an analytic function like Oracle's Lead function. It provides access > to more than one tuple of a bag at the same time without a self join. Given a > bag of tuple returned from a query, LEAD provides access to a tuple at a > given physical offset beyond that position. Generates pairs of all items in a > bag. > If you do not specify offset, then its default is 1. Null is returned if the > offset goes beyond the scope of the bag. > Example 1: > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead('2'); >-- INPUT: ({(1),(2),(3),(4)}) >data = LOAD 'input' AS (data: bag {T: tuple(v:INT)}); >describe data; >-- OUTPUT: ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)}) >-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: > int),elem2: (v: int))}} >data2 = FOREACH data GENERATE Lead(data); >describe data2; >DUMP data2; > {noformat} > Example 2 > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead(); >-- INPUT: > ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})}) >data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: > tuple(v2:INT)})}); >--describe data; >-- OUPUT: > ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)}) >data2 = FOREACH data GENERATE Lead(data); >--describe data2; >DUMP data2; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DATAFU-123) Allow DataFu to include macros
[ https://issues.apache.org/jira/browse/DATAFU-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-123: Attachment: DATAFU-123.patch The change ended up being smaller than what I originally described - all I did was add the "pig.import.search.path" property with the value of the src/main/resources directory to PigTests. This means that any macro files that are put there can be tested, both in Gradle and Eclipse. I put some sample counting macros there and a test for them. In general, any macro file placed in src/main/resources can be used by registering the DataFu jar. If we include this patch, we should update the Contributing page so that instructions for contributing Pig macros are easy to find and understand. > Allow DataFu to include macros > --- > > Key: DATAFU-123 > URL: https://issues.apache.org/jira/browse/DATAFU-123 > Project: DataFu > Issue Type: Improvement >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: testability > Attachments: DATAFU-123.patch > > > A few changes to allow macros to be contributed to DataFu. If a macro file is > placed in src/main/resources, it can be used by registering the DataFu jar. > Such macros can then be tested both from within Eclipse and Gradle. > There are three small parts: > 1) All unit tests that use createPigTest methods will automatically register > the DataFu jar. > 2) Some changes to fix the PigTests.getJarPath() functionality, which doesn't > appear to work. (these changes are aligned with the proposed patch for > [DATAFU-106|https://issues.apache.org/jira/browse/DATAFU-106]) > 3) A sample macro and test > The changes here will allow moving forward with > [DATAFU-61|https://issues.apache.org/jira/browse/DATAFU-61] and including the > macro I suggested for > [DATAFU-119|https://issues.apache.org/jira/browse/DATAFU-119]. (I also have > additional content in mind) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DATAFU-106) Test files should be created in a subfolder of projects
[ https://issues.apache.org/jira/browse/DATAFU-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817823#comment-15817823 ] Eyal Allweil commented on DATAFU-106: - [~takias], I will try to sort our Jira issues out and mark those that are easier to begin with. Have you worked on Pig UDF's before? Piyush - I will try to finish our review as soon as I can! > Test files should be created in a subfolder of projects > --- > > Key: DATAFU-106 > URL: https://issues.apache.org/jira/browse/DATAFU-106 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > Fix For: 1.3.1 > > > Test files are currently created in the subdirectory folder (e.g. > datafu-pig/input*). For better organization, we should create them in a > subdirectory. This also makes it easier to exclude them all with gitignore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15793097#comment-15793097 ] Eyal Allweil commented on DATAFU-119: - If we add DATAFU-123, we can include the macro I put in the description so that people can use it instead of duplicating it in order to conveniently call the UDF. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-87) Edit distance
[ https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606106#comment-15606106 ] Eyal Allweil commented on DATAFU-87: Hi Joydeep, I want to begin by apologizing for the time it's taken us to get to your contribution. Did you ever continue with it? Have you compared your implementation with [the one in Apache Commons Text|https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java] or [Commons Lang|https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L7731]? (I think they follow the same algorithm, from _Algorithms on Strings, Trees and Sequences_ by Dan Gusfield and Chas Emerick) > Edit distance > - > > Key: DATAFU-87 > URL: https://issues.apache.org/jira/browse/DATAFU-87 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Joydeep Banerjee > Attachments: DATAFU-87.patch > > > [This is work-in-progress] > Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) > between them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting
[ https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605952#comment-15605952 ] Eyal Allweil commented on DATAFU-98: Hi Russell. First of all, I want to apologize for the time it's taken us to get to your contribution. I think it could be quite useful. Having said that, I wonder if the current version - without counters - gives us enough of an advantage over vanilla Pig. I think the following code (modified from your unit test) gives us nearly the same functionality as the UDF in the patch: {noformat} data_in = LOAD 'input' as (val:int); -- data_in: "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "20" intermediate_data = FOREACH data_in GENERATE val, (val / 5 * 5) AS binStart; data_out = FOREACH (GROUP intermediate_data BY binStart) GENERATE group AS binStart, COUNT(intermediate_data) AS binCount; -- data_out: (0,5),(5,5),(10,2),(20,1) {noformat} Unlike your UDF, missing bins are not included. But while including missing bins can be useful, I do wonder if a single skewed value can cause problems, especially with small bin sizes and long values. (as a performance-related aside, I would try to have FrequencyCounter.toBag() called only in the Final implementations, instead of the first two stages of the algebraic implementation, to minimize the data copied). So it seems to me the current UDF has the advantage of having the missing bins, and it's obviously more readable and convenient than rewriting the Pig code I wrote above. Did you (or you, [~andrew.musselman]) run any performance tests? Maybe the Algebraic implementation runs faster than the vanilla Pig code by virtue of the combiner use. Last (but not least!) the version you mentioned with counters sounds like it could be really great. > New UDF for Histogram / Frequency counting > -- > > Key: DATAFU-98 > URL: https://issues.apache.org/jira/browse/DATAFU-98 > Project: DataFu > Issue Type: New Feature >Reporter: Russell Melick > Attachments: DATAFU-98.patch > > > I was thinking of creating a new UDF to compute histograms / frequency counts > of input bags. It seems like it would make sense to support ints, longs, > float, and doubles. > I tried looking around to see if this was already implemented, but > ValueHistogram and AggregateWordHistogram were about the only things I found. > They seem to exist as an example job, and only work for Strings. > https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html > https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html > Should the user specify the bin size or the number of bins? Specifying bin > size probably makes the implementation simpler since you can bin things > without having seen all of the data. > I think it would make sense to implement a version of this that didn't need > any reducers. It could use counters to keep track of the counts per bin > without sending any data to a reducer. You would be able to call this > without a preceding GROUP BY as well. > Here's my proposal for the two udfs. This assumes the input data is two > columns, memberId and numConnections. > {code} > DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50') > connections = LOAD 'connections' AS memberId, numConnections; > connectionHistogram = FOREACH (GROUP connections ALL) GENERATE > BinnedFrequency(connections.numConnections); > {code} > The output here would be a bag with the frequency counts > {code} > {('0-49', 5), ('50-99', 0), ('100-149', 10)} > {code} > {code} > DEFINE BinnedFrequencyCounter > datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram') > connections = LOAD 'connections' AS memberId, numConnections; > connections = FOREACH connections GENERATE > BinnedFrequencyCounter(numConnections); > {code} > The output here would just be a counter for each bin, all sharing the same > group of numConnectionsHistogram. It would look something like > numConnectionsHistogram.'0-49' = 5 > numConnectionsHistogram.'50-99' = 0 > numConnectionsHistogram.'100-149' = 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15589407#comment-15589407 ] Eyal Allweil edited comment on DATAFU-25 at 10/19/16 6:01 PM: -- This is a minimal fix that uses getInputSchema() instead of the udf context, when possible. It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], though not the example provided there - BagLeftOuterJoin. This is because MonitoredUDF prevents the udf context from working (see [PIG-3554|https://issues.apache.org/jira/browse/PIG-3554]), and BagJoin uses the udf context for other things, not just those that AliasableEvalFunc provides. I couldn't think of a clean way of adding a test for this, but you can verify that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - this will make the TransposeTest.transposeTest fail, unless my patch is used. I didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding it in the test package means that TransposeTest can't access it. was (Author: eyal): This is a minimal fix that uses getInputSchema() instead of the udf context, when possible. It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], though not the example provided there - BagLeftOuterJoin. This is because MonitoredUDF prevents the udf context from working (see [PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the udf context for other things, not just those that AliasableEvalFunc provides. I couldn't think of a clean way of adding a test for this, but you can verify that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - this will make the TransposeTest.transposeTest fail, unless my patch is used. I didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding it in the test package means that TransposeTest can't access it. > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Will Vaughan > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-25: --- Attachment: DATAFU-25.patch This is a minimal fix that uses getInputSchema() instead of the udf context, when possible. It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], though not the example provided there - BagLeftOuterJoin. This is because MonitoredUDF prevents the udf context from working (see [PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the udf context for other things, not just those that AliasableEvalFunc provides. I couldn't think of a clean way of adding a test for this, but you can verify that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - this will make the TransposeTest.transposeTest fail, unless my patch is used. I didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding it in the test package means that TransposeTest can't access it. > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Will Vaughan > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586085#comment-15586085 ] Eyal Allweil commented on DATAFU-16: It looks like this got added - can this issue be closed? > weighted reservoir sampling with exponential jumps UDF > -- > > Key: DATAFU-16 > URL: https://issues.apache.org/jira/browse/DATAFU-16 > Project: DataFu > Issue Type: New Feature > Environment: Mac, Linux > pig-0.11 >Reporter: jian wang >Assignee: jian wang >Priority: Minor > Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, > WeightedSamplingCorrectnessTests.java > > > Create a weightedReservoirSampleWithExpJump UDF to implement the weighted > reservoir sampling algorithm with exponential jumps. Investigation is tracked > in https://github.com/linkedin/datafu/issues/80. This task is part of > experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-45) RFE: CartesianProduct
[ https://issues.apache.org/jira/browse/DATAFU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584898#comment-15584898 ] Eyal Allweil commented on DATAFU-45: Hi Sam, Did you ever solve this? I agree with Matthew that this should be doable via plain Pig - if not, I'd open a bug there. > RFE: CartesianProduct > - > > Key: DATAFU-45 > URL: https://issues.apache.org/jira/browse/DATAFU-45 > Project: DataFu > Issue Type: New Feature >Reporter: Sam Steingold > > Given two bags, produce their [Cartesian > product|http://en.wikipedia.org/wiki/Cartesian_product]: > {code} > B1: bag{T1} > B2: bag{T2} > CartesianProduct(B1,B2): bag{(T1,T2)} > {code} > Use case: > {code} > toks = TOKENIZE((charray)$0,','); > kwds = CartesianProduct(toks, {1.0/(double)SIZE(toks)}); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-65) Aho-Corasick Pig UDF
[ https://issues.apache.org/jira/browse/DATAFU-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-65: --- Issue Type: New Feature (was: Bug) > Aho-Corasick Pig UDF > > > Key: DATAFU-65 > URL: https://issues.apache.org/jira/browse/DATAFU-65 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 > Environment: Drought >Reporter: Russell Jurney > Attachments: DATAFU-65.diff > > Original Estimate: 8h > Remaining Estimate: 8h > > I need to use the Aho-Corasick algorithm for efficient sub-string matching. A > java implementation is available at > https://github.com/robert-bor/aho-corasick and is available on maven central: > http://maven-repository.com/artifact/org.arabidopsis.ahocorasick/ahocorasick/2.x > A Pig UDF will be very helpful to me. > How do I add a maven dependency with gradle? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-28) Tests are too slow
[ https://issues.apache.org/jira/browse/DATAFU-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571961#comment-15571961 ] Eyal Allweil commented on DATAFU-28: On my machine the datafu-pig tests run in 18 minutes (I ran them with ./gradlew :datafu-pig:test). Is this issue still relevant, or is that an acceptable time? > Tests are too slow > -- > > Key: DATAFU-28 > URL: https://issues.apache.org/jira/browse/DATAFU-28 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes > > I ran the tests on my laptop and it took nearly 2 hours. > The worst offenders are {{datafu.test.pig.sampling}}, > {{datafu.test.pig.stats}}, and {{datafu.test.pig.stats.entropy}}. > ||Package ||Tests|| Failures|| Duration|| Success rate|| > |datafu.test.pig.bags|27 |0| 1m10.72s|100%| > |datafu.test.pig.geo |1 |0 |9.757s |100%| > |datafu.test.pig.hash|4 |0 |41.039s| 100%| > |datafu.test.pig.linkanalysis|5 |0| 32.677s |100%| > |datafu.test.pig.random |1| 0| 11.789s|100%| > |datafu.test.pig.sampling |25|0 |38m25.81s| 100%| > |datafu.test.pig.sessions |7 |0 |2m50.67s |100%| > |datafu.test.pig.sets |9 |0 |5m46.70s |100%| > |datafu.test.pig.stats| 52| 0 |26m11.98s| 100%| > |datafu.test.pig.stats.entropy|40|0 |31m30.97s |100%| > |datafu.test.pig.urls|1 |0 |1m35.24s |100%| > |datafu.test.pig.util|21 |0| 4m51.64s|100%| -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-85) Add SPRINTF to provide this functionality to Pig < 0.14.0
[ https://issues.apache.org/jira/browse/DATAFU-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571787#comment-15571787 ] Eyal Allweil commented on DATAFU-85: Given the time that has passed, and that it can't be backported (easily), I think this issue can/should be closed. > Add SPRINTF to provide this functionality to Pig < 0.14.0 > - > > Key: DATAFU-85 > URL: https://issues.apache.org/jira/browse/DATAFU-85 > Project: DataFu > Issue Type: Bug >Reporter: Russell Jurney >Assignee: Russell Jurney > > I need SPRINTF in DataFu for a book I'm working on. I'd like to add this to > DataFu so that CDH, HDP, MapR, etc. users can use SPRINTF as soon as DataFu > cuts a new release. > See PIG-3939 > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-122) Documentation error/typo on tips and tricks involving Coalesce
[ https://issues.apache.org/jira/browse/DATAFU-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-122: Assignee: Eyal Allweil Labels: documentation typo (was: docuentation typo) Fix Version/s: 1.3.2 Thanks Ryan! I've fixed this in our sources, and it will show up when we release our next version (probably 1.3.2) > Documentation error/typo on tips and tricks involving Coalesce > -- > > Key: DATAFU-122 > URL: https://issues.apache.org/jira/browse/DATAFU-122 > Project: DataFu > Issue Type: Bug >Reporter: Ryan Clough >Assignee: Eyal Allweil >Priority: Trivial > Labels: documentation, typo > Fix For: 1.3.2 > > > http://datafu.incubator.apache.org/docs/datafu/guide/more-tips-and-tricks.html > On this page, an example is given for Coalesce: > {code} > DEFINE EmptyBagToNullFields datafu.pig.util.Coalesce(); > data = FOREACH data GENERATE Coalesce(val,0) as result; > {code} > In this example, "EmpyBagToNullFields" should be replaced with "Coalesce", > which is what is used in the code following the define statement. My guess is > this is a copy paste error from an example further down when > EmpyBagToNullFields is actually used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15500764#comment-15500764 ] Eyal Allweil commented on DATAFU-119: - I've run it on results that were in the tens of millions. I think the main reason for using it / including it in DataFu is that if you're developing Pig code, and running it on a cluster (or on any given environment), being able to stay in the Pig ecosystem is convenient for fast development cycles. If your original job can run on the given environment, a comparison job can run their efficiently, too. And there's less copying because you leave the previous results in the hdfs under a different name, and compare easily. The output is human-readable, but the expected results is that most records return null, because they're identical, and the ones that do come out are usually edge cases that turned out different. That's the reasoning behind having "something" like this UDF. The output type and it's not having a schema is a different story - it would be better to have a schema. But I'm hesitant to spend the time to do it if it isn't likely that someone else will want to write a different output format for it. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15471164#comment-15471164 ] Eyal Allweil commented on DATAFU-119: - Any feedback about this? > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15350489#comment-15350489 ] Eyal Allweil commented on DATAFU-119: - I put up a [reviewboard|https://reviews.apache.org/r/49248/] for this. After some internal discussions, I wonder if the output isn't too specific for general use - I find it very convenient during development for comparing outputs, but it's very much skewed towards human-readability - to make it easy to use the output in Pig, it should have a real schema, not chararray - possibly something with the field names from the original tuples, but boolean or int values to indicate change types. I'd be happy to hear feedback about this. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117-4.patch This patch incorporates the last remaining comment from the review (clearing instead of reassigning the set in cleanup) > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, > DATAFU-117-4.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117-3.patch Incorporates changes from [review |https://reviews.apache.org/r/46701/] > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258239#comment-15258239 ] Eyal Allweil edited comment on DATAFU-117 at 5/9/16 8:50 AM: - Ok, I opened a review board for it - It's at https://reviews.apache.org/r/46701/ I think all your previous comments are addressed there, except for the one about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this can exceed the max size, because a single add operation can only increment the set's size by one, and the UDF is executed in a single thread. I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT followed by the builtin COUNT. On small inputs they perform about the same - even up to a million records - but if you have a situation with more skew (I checked 10 million records, with about 4 million distincts) then this UDF with a max value of say, 1,000,000, runs in a few minutes, and the nested foreach+distinct+count takes more than an hour - probably because it needs to keep all the distinct records in memory, rather than just reaching the desired threshold. was (Author: eyal): Ok, I opened a review board for it - can you see it? It's at https://reviews.apache.org/r/46701/ I think all your previous comments are addressed there, except for the one about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this can exceed the max size, because a single add operation can only increment the set's size by one, and the UDF is executed in a single thread. I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT followed by the builtin COUNT. On small inputs they perform about the same - even up to a million records - but if you have a situation with more skew (I checked 10 million records, with about 4 million distincts) then this UDF with a max value of say, 1, runs in about four minutes, and the nested foreach+distinct+count takes more than an hour - probably because it needs to keep all the distinct records in memory, rather than just reaching the desired threshold. > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117-2.patch This replaces the previous patch file, addresses (most of) Matthew's comments, and adds an Algebraic implementation to the UDF. > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215634#comment-15215634 ] Eyal Allweil commented on DATAFU-115: - Thanks! > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Assignee: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213559#comment-15213559 ] Eyal Allweil commented on DATAFU-115: - I'm not sure why, but I can't see this patch in the master branch. I can see https://issues.apache.org/jira/browse/DATAFU-114 - [FirstTupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/FirstTupleFromBag.java] appears changed - but [TupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/TupleFromBag.java] looks like it hasn't been changed since August. Does the public GitHub represent the repository accurately? > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Assignee: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117.patch Patch including new UDF and test (in BagTests) > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-117) New UDF - CountDistinctUpTo
Eyal Allweil created DATAFU-117: --- Summary: New UDF - CountDistinctUpTo Key: DATAFU-117 URL: https://issues.apache.org/jira/browse/DATAFU-117 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil A UDF that counts distinct tuples within a bag, but only up to a preset limit. If the bag contains more distinct tuples than the limit, the UDF returns the limit. This UDF can run reasonably well even on large bags if the limit chosen is small enough though the count is done in memory. We use this UDF in PayPal for filtering, when we don't need to use the actual tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185409#comment-15185409 ] Eyal Allweil commented on DATAFU-116: - As far as I can tell, when the accumulator is used, Pig passes _pig.accumulative.batchsize_ tuples from each bag until all the tuples are exhausted. I think an implementation that iterates over the bags and only keeps some of the tuples in between batches is possible - hopefully very few, but the worst case is all of them, which is no worse than the current implementation. I'm assuming Pig passes batches in this way based on the code in [POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java] and from looking through all the documentation I could find on accumulators. If I'm wrong it does mean that an accumulator implementation isn't worthwhile. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
Eyal Allweil created DATAFU-116: --- Summary: Make SetIntersect and SetDifference implement Accumulator Key: DATAFU-116 URL: https://issues.apache.org/jira/browse/DATAFU-116 Project: DataFu Issue Type: Improvement Affects Versions: 1.3.0 Reporter: Eyal Allweil SetIntersect and SetDifference accept only sorted bags, and the output is always smaller than the inputs. Therefore an accumulator implementation should be possible and it will improve memory usage (somewhat) and allow Pig to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-115: Flags: Patch > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-115: Attachment: DATAFU-115.patch Relatively straightforward patch ... there's one difference from the previous behavior, that if an exception is thrown, I ignore it and try to continue iterating to the desired index. I tried uploading it to the review board, see if [this link|https://reviews.apache.org/r/44351/] works. > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-115) Make TupleFromBag implement Accumulator
Eyal Allweil created DATAFU-115: --- Summary: Make TupleFromBag implement Accumulator Key: DATAFU-115 URL: https://issues.apache.org/jira/browse/DATAFU-115 Project: DataFu Issue Type: Improvement Affects Versions: 1.3.0 Reporter: Eyal Allweil Priority: Minor Fix For: 1.3.1 Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. TupleFromBag doesn't need to hold the bag in memory, and can iterate through it until it reaches the desired tuple. By implementing Accumulator, larger bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150312#comment-15150312 ] Eyal Allweil commented on DATAFU-114: - Thanks! After I imported the projects individually, like you suggested, it works fine in Eclipse ... I suggest adding a sentence about it in the base readme file to help out future contributors > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Assignee: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Fix For: 1.3.1 > > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15131991#comment-15131991 ] Eyal Allweil commented on DATAFU-114: - Anyone? > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114990#comment-15114990 ] Eyal Allweil commented on DATAFU-114: - Any comments? Can this patch be pulled? > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
Eyal Allweil created DATAFU-114: --- Summary: Make FirstTupleFromBag implement Accumulator Key: DATAFU-114 URL: https://issues.apache.org/jira/browse/DATAFU-114 Project: DataFu Issue Type: Improvement Affects Versions: 1.3.0 Environment: All Reporter: Eyal Allweil Priority: Minor FirstTupleFromBag only needs the first tuple from the bag, but because it doesn't implement Accumulator the entire bag needs to be passed to it in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-114: Attachment: FirstTupleFromBag.java I wasn't able to test this patch because I can't get the build working on my system (Ubuntu LTS) .. I'm getting the error described [here|https://issues.apache.org/jira/browse/DATAFU-95]. I can't seem to make Gradle use a different Java to get it to compile. However, since the implementation of Accumulator is relatively straightforward, I hopefully haven't made any mistakes and I would appreciate if someone whose build is working tried it out and pulled the patch. > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-95) Improve wrong JDK error message
[ https://issues.apache.org/jira/browse/DATAFU-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097858#comment-15097858 ] Eyal Allweil commented on DATAFU-95: As an immediate, easy-to-do improvement, writing what Java version is required in the main README on GitHub would be great. > Improve wrong JDK error message > --- > > Key: DATAFU-95 > URL: https://issues.apache.org/jira/browse/DATAFU-95 > Project: DataFu > Issue Type: Improvement >Reporter: Jakob Homan >Priority: Minor > > Right now if one tries to build against JDK1.7, the resulting failure is a > bit unclear: > {noformat}Download > https://repo1.maven.org/maven2/org/eclipse/equinox/app/1.3.200-v20130910-1609/app-1.3.200-v20130910-1609.jar > /Users/jahoman/repos/datafu/build-plugin/src/main/java/org/adrianwalker/multilinestring/MultilineProcessor.java:18: > error: cannot find symbol > @SupportedSourceVersion(SourceVersion.RELEASE_8) > ^ > symbol: variable RELEASE_8 > location: class SourceVersion > 1 error > :build-plugin:compileJava FAILED > FAILURE: Build failed with an exception. > {noformat} > It may be better to use something like [The > Sweeney|https://github.com/boxheed/gradle-sweeney-plugin] to enforce this and > provide a better, faster message on failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)