Hi Xikui, So when deploying my UDF to AsterixDB, I've put the content of the unzipped testlib folder into this folder: apache-asterixdb-0.9.5-SNAPSHOT/lib/udfs/feeds/testlib/
The resulting testlib content then looks like this: - library_descriptor.xml - asterix-udf-template-0.1-SNAPSHOT.jar - lib (folder with external dependencies) However, since the dependencies from this /lib folder ought to be copied into apache-asterixdb-0.9.5-SNAPSHOT/repo instead, should I delete the apache-asterixdb-0.9.5-SNAPSHOT/lib/udfs/feeds/testlib/lib folder which are created when dropping the unzipped UDF package inside testlib, or keep the dependencies both there and in /repo? Thanks! On 2018/11/28 17:03:31, Xikui Wang <[email protected]> wrote: > Hi Sandra, > > If you are following the binary-assembly-libzip.xml that you showed to me > earlier, the specified dependency jars should be under the lib directory in > your compiled UDF package, i.e., "- lib (dictionary containing .jars for my > dependencies listed above in binary-assembly-libzip.xml)". You can copy all > the jar files in this directory to the repo directory in AsterixDB. That > would work. As for the repacking part, that was for those who want to > distribute their patched AsterixDB to their users. In your case, you can > ignore that. > > Best, > Xikui > > On Wed, Nov 28, 2018 at 1:51 AM [email protected] < > [email protected]> wrote: > > > Hi, thanks again Xikui! > > > > I am trying the latter option now – dropping the dependency jars into the > > /repo folder. Does it have anything to say where I copy the dependency jars > > from? > > > > In addition, I think I should provide some context of my locally run > > instance of AsterixDB: > > - I have cloned the asterixdb repo from github, so I have it local on my > > Macbook Pro. > > - Inside the cloned folder, > > asterixdb/asterixdb/asterix-server/target/asterix-server-0.9.5-SNAPSHOT-binary-assembly > > folder, there lies a folder called apache-asterixdb-0.9.5-SNAPSHOT, which > > in turn contains the folders bin, etc, lib, opt and repo. > > - It is inside _this_ repo folder I am putting the dependency jars. > > - It is from this /opt/local/bin folder I am running sh > > start-sample-cluster.sh > > > > So, when following the option 2 example provided in your link [1], it says > > to repach this folder into a zip again. I don't quite get this, as this is > > the folder I am using to run AsterixDB? > > > > Thanks in advance! > > > > Best regards, > > Sandra > > > > > > On 2018/11/27 16:38:23, Xikui Wang <[email protected]> wrote: > > > The configuration seems alright, but it's very hard to say where the > > > problem is since I haven't had the chance to see what is exactly in your > > > lib directory. If this packaging doesn't work for you, you can try to > > pack > > > the dependencies into the UDF jar as a single fat jar, or you can drop > > the > > > dependency jars into the "asterix-server-0.9.*-binary-assembly/repo" > > directory, > > > so they can be distributed with the AsterixDB instance. I would recommend > > > the latter method, as you don't have to redeploy the dependency jars > > every > > > time when a UDF changes. These two methods are described in the > > > documentation of the UDF template repo [1]. :) > > > > > > [1] https://github.com/idleft/asterix-udf-template > > > > > > Best, > > > Xikui > > > > > > On Tue, Nov 27, 2018 at 6:04 AM [email protected] < > > > [email protected]> wrote: > > > > > > > Thank you for making sense of the log file for me, I managed to get the > > > > parameters work! > > > > > > > > However, a new challenge became evident, of course. The new error that > > I > > > > am seeing (java.lang.ClassNotFoundException in the cc.log when trying > > to > > > > use one of the dependencies in my code). I think this may be happening > > due > > > > to the external dependency, and if it is reachable or not from my UDF > > when > > > > running locally on AsterixDB. Could you explain if my approach for > > > > including external dependencies are right or not (approach/steps listed > > > > below)? > > > > > > > > 1. The binary-assembly-libzip.xml looks like this, where the > > dependencies > > > > are included at the bottom: > > > > > > > > <assembly> > > > > <id>testlib</id> > > > > <formats> > > > > <format>zip</format> > > > > </formats> > > > > <includeBaseDirectory>false</includeBaseDirectory> > > > > <fileSets> > > > > <fileSet> > > > > <directory>target</directory> > > > > <outputDirectory/> > > > > <includes> > > > > <include>*.jar</include> > > > > </includes> > > > > </fileSet> > > > > <fileSet> > > > > <directory>src/main/resources</directory> > > > > <outputDirectory/> > > > > <includes> > > > > <include>library_descriptor.xml</include> > > > > </includes> > > > > </fileSet> > > > > </fileSets> > > > > <dependencySets> > > > > <dependencySet> > > > > <includes> > > > > <include>commons-io:commons-io</include> > > > > <include>ch.qos.logback:logback-core</include> > > > > <include>org.slf4j:slf4j-api</include> > > > > <include>ch.qos.logback:logback-classic</include> > > > > <include>org.deeplearning4j:deeplearning4j-core</include> > > > > > > <include>org.deeplearning4j:deeplearning4j-modelimport</include> > > > > <include>org.deeplearning4j:deeplearning4j-nlp</include> > > > > <include>org.nd4j:nd4j-api</include> > > > > <include>org.nd4j:nd4j-native</include> > > > > </includes> > > > > <unpack>false</unpack> > > > > <outputDirectory>lib</outputDirectory> > > > > </dependencySet> > > > > </dependencySets> > > > > </assembly> > > > > > > > > 2. When the Maven project is built (mvn clean install), it generates > > files > > > > in /target: > > > > - asterix-udf-template-0.1-SNAPSHOT-testlib.zip > > > > - asterix-udf-template-0.1-SNAPSHOT.jar > > > > - archive-tmp > > > > - classes > > > > - generated-sources > > > > - maven-archiver > > > > - maven-status > > > > > > > > 3. When unzipping the uppermost file (testlib), it contains: > > > > - lib (dictionary containing .jars for my dependencies listed above in > > > > binary-assembly-libzip.xml) > > > > - library_descriptor.xml > > > > - asterix-udf-template-0.1-SNAPSHOT.jar > > > > > > > > 4. And when unzipping the bottommost .jar inside the testlib here, it > > > > contains: > > > > - my model (model.bin.gz) > > > > - library_descriptor.xml > > > > - META-INF > > > > - org.apache.asterix.external > > > > ----> contains my classes > > > > > > > > Does this look right? > > > > > > > > I appreciate your help! > > > > > > > > Best regards, > > > > Sandra > > > > > > > > On 2018/11/27 06:38:58, Xikui Wang <[email protected]> wrote: > > > > > Hi Sandra, > > > > > > > > > > Based on the log, it seems you have an IndexOutOfBoundsException in > > your > > > > > UDF code. Can you double check your UDF at > > > > > > > > > > > org.apache.asterix.external.library.RelevanceDetecterFunction.initialize(RelevanceDetecterFunction.java:33) > > > > > and your UDF configuration file? You will have to make sure the > > > > parameters > > > > > are specified properly in the config file, and they are properly > > accessed > > > > > in the initialize method. > > > > > > > > > > Best, > > > > > Xikui > > > > > > > > > > On Mon, Nov 26, 2018 at 1:33 PM [email protected] < > > > > > [email protected]> wrote: > > > > > > > > > > > Hi Xikui! > > > > > > > > > > > > So I tried to add the resource as a parameter. However, I get this > > > > error > > > > > > (gist with log from cc.log) [1] when the query below is executed: > > > > > > > > > > > > USE feeds; > > > > > > CONNECT FEED TestSocketFeed TO DATASET RelevantDataset > > > > > > APPLY function testlib#detectRelevance; start feed TestSocketFeed > > > > > > > > > > > > To provide some context, this query works as it should when I don't > > > > > > include the model. > > > > > > > > > > > > [1] > > > > https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb > > > > > > > > > > > > Best regards, > > > > > > Sandra > > > > > > > > > > > > > > > > > > > > > > > > On 2018/11/26 05:45:03, Xikui Wang <[email protected]> wrote: > > > > > > > Hi Sandra, > > > > > > > > > > > > > > Here is an example for adding parameters to a UDF [1]. As you can > > > > see, > > > > > > the > > > > > > > function "KeywordsDetectorFactory" reads a given list path from > > a UDF > > > > > > > parameter. You can use this to reuse a Java function with > > different > > > > > > > resource files. This function is contained in the AsterixDB > > release > > > > as > > > > > > > well. Please make sure the path to the resource file is correct > > when > > > > you > > > > > > > use it. That's a tricky part that I always make mistakes. > > > > > > > > > > > > > > The initialize(), i.e. the model loading, is executed when the > > "start > > > > > > feed" > > > > > > > statement is executed. This doesn't require Tweets to come. Is > > that > > > > the > > > > > > > case you are referring to? > > > > > > > > > > > > > > As for your use case, here is an interesting thing that you can > > try. > > > > > > There > > > > > > > is a feature in the data feeds which is currently not in our > > > > > > documentation, > > > > > > > which is to allow you to filter out incoming data by query > > > > predicates. If > > > > > > > you want to filter out Tweets with the model file that you > > trained, > > > > you > > > > > > can > > > > > > > attach a Java UDF on your ingestion pipeline with the following > > > > query: > > > > > > > > > > > > > > use test; > > > > > > > create type InputRecordType as closed { > > > > > > > id:int64, > > > > > > > fname:string, > > > > > > > lname:string, > > > > > > > age:int64, > > > > > > > dept:string > > > > > > > }; > > > > > > > create dataset EmpDataset(InputRecordType) primary key id; > > > > > > > create feed UserFeed with { > > > > > > > "adapter-name" : "socket_adapter", > > > > > > > "sockets" : "127.0.0.1:10001", > > > > > > > "address-type" : "IP", > > > > > > > "type-name" : "InputRecordType", > > > > > > > "format" : "delimited-text", > > > > > > > "delimiter" : "|", > > > > > > > "upsert-feed" : "true" > > > > > > > }; > > > > > > > *connect feed UserFeed to dataset EmpDataset WHERE > > > > > > > testlib#wordDetector(fname) = TRUE;* > > > > > > > start feed UserFeed; > > > > > > > > > > > > > > The Java UDF used here is in [2]. This can help you filter out > > > > unwanted > > > > > > > incoming data on the pipeline. :) > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml > > > > > > > > > > > > > > [2] > > > > > > > > > > > > > > > > > > > https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java > > > > > > > > > > > > > > Best, > > > > > > > Xikui > > > > > > > > > > > > > > On Sun, Nov 25, 2018 at 1:05 PM [email protected] < > > > > > > > [email protected]> wrote: > > > > > > > > > > > > > > > Hi Xikui, > > > > > > > > > > > > > > > > Thanks for your response! > > > > > > > > We managed to cope with the problem by using the compressed > > > > version of > > > > > > the > > > > > > > > model instead, but it is still 1.6 GB. However, the project is > > > > able to > > > > > > > > build now :-) Yes, this is being packed into the UDF jar at the > > > > > > moment. Do > > > > > > > > you have any examples that illustrates how to use the resource > > file > > > > > > path as > > > > > > > > a UDF parameter? That would be very helpful! > > > > > > > > > > > > > > > > In addition, I believe that the model loading – which is now > > being > > > > > > > > executed during initialize() – restrains the incoming tweets of > > > > being > > > > > > > > processed. This is evident because none of the streaming > > elements > > > > are > > > > > > > > stored in AsterixDB when the model loading is included in the > > code, > > > > > > whilst > > > > > > > > the elements are stored when I exclude the model loading from > > the > > > > > > code. Is > > > > > > > > it possible to make the model load, i.e making initialize() > > run, > > > > prior > > > > > > the > > > > > > > > arrival of the tweets at the socketfeed? > > > > > > > > > > > > > > > > Regarding our project, we are trying to detect tweets which are > > > > > > relevant > > > > > > > > for a given "user query", where the goal is crisis detection. > > So > > > > we are > > > > > > > > trying to filter out (i.e _not_ store or keep in the pipeline) > > > > tweets > > > > > > which > > > > > > > > do not contain the relevant location etc. The model I've talked > > > > about > > > > > > is > > > > > > > > being used for word embeddings (word2vec) :-) > > > > > > > > > > > > > > > > Best regards, > > > > > > > > Sandra Skarshaug > > > > > > > > > > > > > > > > > > > > > > > > On 2018/11/24 17:55:27, Xikui Wang <[email protected]> wrote: > > > > > > > > > Hi Sandra, > > > > > > > > > > > > > > > > > > How big is the model file that you are using? I guess you are > > > > trying > > > > > > to > > > > > > > > > pack this model file into the UDF jar? I personally haven't > > seen > > > > this > > > > > > > > error > > > > > > > > > before. It feels like a Maven building with big files issue. > > I > > > > found > > > > > > this > > > > > > > > > thread on StackOverflow which describes the similar > > situation. > > > > Could > > > > > > you > > > > > > > > > try the resolutions there? > > > > > > > > > > > > > > > > > > As a side note, if you need to use a big model file in UDF, I > > > > > > wouldn't > > > > > > > > > suggest you pack that into your UDF jar file. It's because > > this > > > > will > > > > > > > > > significantly slow down your UDF installation, and you will > > > > spend a > > > > > > lot > > > > > > > > of > > > > > > > > > time redeploying the resource file to the cluster if you only > > > > need to > > > > > > > > > update the UDF code. Alternatively, you could make the > > resource > > > > file > > > > > > path > > > > > > > > > as a UDF parameter, and let the UDF load that file when it > > > > > > initializes. > > > > > > > > > This could make the installation much faster and avoid > > deploying > > > > the > > > > > > > > > resource file multiple times, and the packing issue should be > > > > gone as > > > > > > > > well. > > > > > > > > > :) > > > > > > > > > > > > > > > > > > PS If it's ok, could you tell us which use case that you are > > > > working > > > > > > on? > > > > > > > > We > > > > > > > > > would like to know how our customers use AsterixDB in > > different > > > > > > > > scenarios, > > > > > > > > > so we can help them (you) better! > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Xikui > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Nov 24, 2018 at 6:05 AM [email protected] < > > > > > > > > > [email protected]> wrote: > > > > > > > > > > > > > > > > > > > Hi! > > > > > > > > > > > > > > > > > > > > My master thesis partner and I have added a model for word > > > > > > embeddings > > > > > > > > > > (word2vec) in our project which is quite large. This is > > > > supposed > > > > > > to be > > > > > > > > > > loaded in the initialize phase of the UDF and be used for > > > > > > evaluating > > > > > > > > the > > > > > > > > > > incoming records. > > > > > > > > > > > > > > > > > > > > However, when trying to build the Maven project before > > > > deploying > > > > > > it to > > > > > > > > > > AsterixDB, we get the error "Error assembling JAR, invalid > > > > entry > > > > > > > > size". Is > > > > > > > > > > this a problem anyone else have faced when for instance > > using > > > > > > machine > > > > > > > > > > learning models in AsterixDB? > > > > > > > > > > > > > > > > > > > > If so, we appreciate any help! > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > Sandra > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
