Hi all, I'd like to announce the release of Apache DataFu (incubating) 1.3.0. This is the first release since entering the Apache incubator. Thanks to all who contributed!
Apache DataFu is a collection of libraries for working with large-scale data in Hadoop. The project was inspired by the need for stable, well-tested libraries for data mining and statistics. It consists of two libraries: Apache DataFu Pig, a collection of user-defined functions for Apache Pig, and Apache DataFu Hourglass, an incremental processing framework for Apache Hadoop in MapReduce. You can obtain the source release from: http://www.apache.org/dyn/closer.cgi/incubator/datafu/apache-datafu-incubating-1.3.0/ Please follow the README for instructions on building. A summary of changes for 1.3.0 appears below. Additions: * New UDFs for entropy and weighted sampling algorithms (DATAFU-2, DATAFU-26) * Updated SimpleRandomSample to be consistent with SimpleRandomSampleWithReplacement (DATAFU-5) * Created OpenNLP UDF wrappers (DATAFU-8) * Created RandomUUID UDF (DATAFU-18) * Added LSH implementation (DATAFU-37) * Added Base64Encode/Decode (DATAFU-52) * URLInfo UDF (DATAFU-62) * Created SelectFieldByName UDF (DATAFU-69) * Added generic BagJoin that supports inner, left, and full outer joins (DATAFU-70) * Added ZipBags UDF which can zip and arbitrary number of bags into one (DATAFU-79) * Hadoop 2.0 compatibility (DATAFU-58) * Created TupleFromBag.java file (DATAFU-92) Improvements: * Simplified BagGroup output (DATAFU-42) Changes: * StagedOutputJob no longer writes counters by default (DATAFU-35) Fixes: * ReservoirSample does not behave as expected when grouping by a key other than ALL (DATAFU-11) * DistinctBy does not work correctly on strings containing minuses (DATAFU-31) * Hourglass does not honor "fail on missing" in all cases (DATAFU-35) * Hash UDFs return zero-padded strings of uniform length even when leading bits are zero (DATAFU 46) * UDF examples work again (DATAFU-49) * SampleByKey can throw NullPointerException (DATAFU-68) Build system: * Removed legacy checked in jars (DATAFU-55) * Updated to use Pig 0.12.1 (DATAFU-10) * Switched from Ant to Gradle 1.12 (DATAFU-27, DATAFU-44, DATAFU-43, DATAFU-66) * Removed checked in jars, download where necessary (DATAFU-55) * Fixed test.sh to use gradlew (DATAFU-77) Release related: * NOTICE updated with dependencies used or shipped with DataFu. * Apache license headers added to all necessary files (DATAFU-4, DATAFU-75) * Added doap file (DATAFU-36) * Source tarball generation, gradle bootstrapping, and release instructions (DATAFU-57, DATAFU-78, DATAFU-72) * Removed author tags (DATAFU-74) * Resolved issues with build-plugin directory (DATAFU-76) * Used Apache RAT to verify correct file headers (DATAFU-73, DATAFU-84) Documentation related: * New website (DATAFU-20, etc.) * StreamingQuantile PDF link is broken (DATAFU-29) * README file updated Cheers, Matt