The configuration seems alright, but it's very hard to say where the problem is since I haven't had the chance to see what is exactly in your lib directory. If this packaging doesn't work for you, you can try to pack the dependencies into the UDF jar as a single fat jar, or you can drop the dependency jars into the "asterix-server-0.9.*-binary-assembly/repo" directory, so they can be distributed with the AsterixDB instance. I would recommend the latter method, as you don't have to redeploy the dependency jars every time when a UDF changes. These two methods are described in the documentation of the UDF template repo [1]. :)
[1] https://github.com/idleft/asterix-udf-template Best, Xikui On Tue, Nov 27, 2018 at 6:04 AM [email protected] < [email protected]> wrote: > Thank you for making sense of the log file for me, I managed to get the > parameters work! > > However, a new challenge became evident, of course. The new error that I > am seeing (java.lang.ClassNotFoundException in the cc.log when trying to > use one of the dependencies in my code). I think this may be happening due > to the external dependency, and if it is reachable or not from my UDF when > running locally on AsterixDB. Could you explain if my approach for > including external dependencies are right or not (approach/steps listed > below)? > > 1. The binary-assembly-libzip.xml looks like this, where the dependencies > are included at the bottom: > > <assembly> > <id>testlib</id> > <formats> > <format>zip</format> > </formats> > <includeBaseDirectory>false</includeBaseDirectory> > <fileSets> > <fileSet> > <directory>target</directory> > <outputDirectory/> > <includes> > <include>*.jar</include> > </includes> > </fileSet> > <fileSet> > <directory>src/main/resources</directory> > <outputDirectory/> > <includes> > <include>library_descriptor.xml</include> > </includes> > </fileSet> > </fileSets> > <dependencySets> > <dependencySet> > <includes> > <include>commons-io:commons-io</include> > <include>ch.qos.logback:logback-core</include> > <include>org.slf4j:slf4j-api</include> > <include>ch.qos.logback:logback-classic</include> > <include>org.deeplearning4j:deeplearning4j-core</include> > <include>org.deeplearning4j:deeplearning4j-modelimport</include> > <include>org.deeplearning4j:deeplearning4j-nlp</include> > <include>org.nd4j:nd4j-api</include> > <include>org.nd4j:nd4j-native</include> > </includes> > <unpack>false</unpack> > <outputDirectory>lib</outputDirectory> > </dependencySet> > </dependencySets> > </assembly> > > 2. When the Maven project is built (mvn clean install), it generates files > in /target: > - asterix-udf-template-0.1-SNAPSHOT-testlib.zip > - asterix-udf-template-0.1-SNAPSHOT.jar > - archive-tmp > - classes > - generated-sources > - maven-archiver > - maven-status > > 3. When unzipping the uppermost file (testlib), it contains: > - lib (dictionary containing .jars for my dependencies listed above in > binary-assembly-libzip.xml) > - library_descriptor.xml > - asterix-udf-template-0.1-SNAPSHOT.jar > > 4. And when unzipping the bottommost .jar inside the testlib here, it > contains: > - my model (model.bin.gz) > - library_descriptor.xml > - META-INF > - org.apache.asterix.external > ----> contains my classes > > Does this look right? > > I appreciate your help! > > Best regards, > Sandra > > On 2018/11/27 06:38:58, Xikui Wang <[email protected]> wrote: > > Hi Sandra, > > > > Based on the log, it seems you have an IndexOutOfBoundsException in your > > UDF code. Can you double check your UDF at > > > org.apache.asterix.external.library.RelevanceDetecterFunction.initialize(RelevanceDetecterFunction.java:33) > > and your UDF configuration file? You will have to make sure the > parameters > > are specified properly in the config file, and they are properly accessed > > in the initialize method. > > > > Best, > > Xikui > > > > On Mon, Nov 26, 2018 at 1:33 PM [email protected] < > > [email protected]> wrote: > > > > > Hi Xikui! > > > > > > So I tried to add the resource as a parameter. However, I get this > error > > > (gist with log from cc.log) [1] when the query below is executed: > > > > > > USE feeds; > > > CONNECT FEED TestSocketFeed TO DATASET RelevantDataset > > > APPLY function testlib#detectRelevance; start feed TestSocketFeed > > > > > > To provide some context, this query works as it should when I don't > > > include the model. > > > > > > [1] > https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb > > > > > > Best regards, > > > Sandra > > > > > > > > > > > > On 2018/11/26 05:45:03, Xikui Wang <[email protected]> wrote: > > > > Hi Sandra, > > > > > > > > Here is an example for adding parameters to a UDF [1]. As you can > see, > > > the > > > > function "KeywordsDetectorFactory" reads a given list path from a UDF > > > > parameter. You can use this to reuse a Java function with different > > > > resource files. This function is contained in the AsterixDB release > as > > > > well. Please make sure the path to the resource file is correct when > you > > > > use it. That's a tricky part that I always make mistakes. > > > > > > > > The initialize(), i.e. the model loading, is executed when the "start > > > feed" > > > > statement is executed. This doesn't require Tweets to come. Is that > the > > > > case you are referring to? > > > > > > > > As for your use case, here is an interesting thing that you can try. > > > There > > > > is a feature in the data feeds which is currently not in our > > > documentation, > > > > which is to allow you to filter out incoming data by query > predicates. If > > > > you want to filter out Tweets with the model file that you trained, > you > > > can > > > > attach a Java UDF on your ingestion pipeline with the following > query: > > > > > > > > use test; > > > > create type InputRecordType as closed { > > > > id:int64, > > > > fname:string, > > > > lname:string, > > > > age:int64, > > > > dept:string > > > > }; > > > > create dataset EmpDataset(InputRecordType) primary key id; > > > > create feed UserFeed with { > > > > "adapter-name" : "socket_adapter", > > > > "sockets" : "127.0.0.1:10001", > > > > "address-type" : "IP", > > > > "type-name" : "InputRecordType", > > > > "format" : "delimited-text", > > > > "delimiter" : "|", > > > > "upsert-feed" : "true" > > > > }; > > > > *connect feed UserFeed to dataset EmpDataset WHERE > > > > testlib#wordDetector(fname) = TRUE;* > > > > start feed UserFeed; > > > > > > > > The Java UDF used here is in [2]. This can help you filter out > unwanted > > > > incoming data on the pipeline. :) > > > > > > > > [1] > > > > > > > > https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml > > > > > > > > [2] > > > > > > > > https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java > > > > > > > > Best, > > > > Xikui > > > > > > > > On Sun, Nov 25, 2018 at 1:05 PM [email protected] < > > > > [email protected]> wrote: > > > > > > > > > Hi Xikui, > > > > > > > > > > Thanks for your response! > > > > > We managed to cope with the problem by using the compressed > version of > > > the > > > > > model instead, but it is still 1.6 GB. However, the project is > able to > > > > > build now :-) Yes, this is being packed into the UDF jar at the > > > moment. Do > > > > > you have any examples that illustrates how to use the resource file > > > path as > > > > > a UDF parameter? That would be very helpful! > > > > > > > > > > In addition, I believe that the model loading – which is now being > > > > > executed during initialize() – restrains the incoming tweets of > being > > > > > processed. This is evident because none of the streaming elements > are > > > > > stored in AsterixDB when the model loading is included in the code, > > > whilst > > > > > the elements are stored when I exclude the model loading from the > > > code. Is > > > > > it possible to make the model load, i.e making initialize() run, > prior > > > the > > > > > arrival of the tweets at the socketfeed? > > > > > > > > > > Regarding our project, we are trying to detect tweets which are > > > relevant > > > > > for a given "user query", where the goal is crisis detection. So > we are > > > > > trying to filter out (i.e _not_ store or keep in the pipeline) > tweets > > > which > > > > > do not contain the relevant location etc. The model I've talked > about > > > is > > > > > being used for word embeddings (word2vec) :-) > > > > > > > > > > Best regards, > > > > > Sandra Skarshaug > > > > > > > > > > > > > > > On 2018/11/24 17:55:27, Xikui Wang <[email protected]> wrote: > > > > > > Hi Sandra, > > > > > > > > > > > > How big is the model file that you are using? I guess you are > trying > > > to > > > > > > pack this model file into the UDF jar? I personally haven't seen > this > > > > > error > > > > > > before. It feels like a Maven building with big files issue. I > found > > > this > > > > > > thread on StackOverflow which describes the similar situation. > Could > > > you > > > > > > try the resolutions there? > > > > > > > > > > > > As a side note, if you need to use a big model file in UDF, I > > > wouldn't > > > > > > suggest you pack that into your UDF jar file. It's because this > will > > > > > > significantly slow down your UDF installation, and you will > spend a > > > lot > > > > > of > > > > > > time redeploying the resource file to the cluster if you only > need to > > > > > > update the UDF code. Alternatively, you could make the resource > file > > > path > > > > > > as a UDF parameter, and let the UDF load that file when it > > > initializes. > > > > > > This could make the installation much faster and avoid deploying > the > > > > > > resource file multiple times, and the packing issue should be > gone as > > > > > well. > > > > > > :) > > > > > > > > > > > > PS If it's ok, could you tell us which use case that you are > working > > > on? > > > > > We > > > > > > would like to know how our customers use AsterixDB in different > > > > > scenarios, > > > > > > so we can help them (you) better! > > > > > > > > > > > > Best, > > > > > > Xikui > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Nov 24, 2018 at 6:05 AM [email protected] < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > Hi! > > > > > > > > > > > > > > My master thesis partner and I have added a model for word > > > embeddings > > > > > > > (word2vec) in our project which is quite large. This is > supposed > > > to be > > > > > > > loaded in the initialize phase of the UDF and be used for > > > evaluating > > > > > the > > > > > > > incoming records. > > > > > > > > > > > > > > However, when trying to build the Maven project before > deploying > > > it to > > > > > > > AsterixDB, we get the error "Error assembling JAR, invalid > entry > > > > > size". Is > > > > > > > this a problem anyone else have faced when for instance using > > > machine > > > > > > > learning models in AsterixDB? > > > > > > > > > > > > > > If so, we appreciate any help! > > > > > > > > > > > > > > Best regards, > > > > > > > Sandra > > > > > > > > > > > > > > > > > > > > > > > > > > > >
