Hi Xikui! So I tried to add the resource as a parameter. However, I get this error (gist with log from cc.log) [1] when the query below is executed:
USE feeds; CONNECT FEED TestSocketFeed TO DATASET RelevantDataset APPLY function testlib#detectRelevance; start feed TestSocketFeed To provide some context, this query works as it should when I don't include the model. [1] https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb Best regards, Sandra On 2018/11/26 05:45:03, Xikui Wang <[email protected]> wrote: > Hi Sandra, > > Here is an example for adding parameters to a UDF [1]. As you can see, the > function "KeywordsDetectorFactory" reads a given list path from a UDF > parameter. You can use this to reuse a Java function with different > resource files. This function is contained in the AsterixDB release as > well. Please make sure the path to the resource file is correct when you > use it. That's a tricky part that I always make mistakes. > > The initialize(), i.e. the model loading, is executed when the "start feed" > statement is executed. This doesn't require Tweets to come. Is that the > case you are referring to? > > As for your use case, here is an interesting thing that you can try. There > is a feature in the data feeds which is currently not in our documentation, > which is to allow you to filter out incoming data by query predicates. If > you want to filter out Tweets with the model file that you trained, you can > attach a Java UDF on your ingestion pipeline with the following query: > > use test; > create type InputRecordType as closed { > id:int64, > fname:string, > lname:string, > age:int64, > dept:string > }; > create dataset EmpDataset(InputRecordType) primary key id; > create feed UserFeed with { > "adapter-name" : "socket_adapter", > "sockets" : "127.0.0.1:10001", > "address-type" : "IP", > "type-name" : "InputRecordType", > "format" : "delimited-text", > "delimiter" : "|", > "upsert-feed" : "true" > }; > *connect feed UserFeed to dataset EmpDataset WHERE > testlib#wordDetector(fname) = TRUE;* > start feed UserFeed; > > The Java UDF used here is in [2]. This can help you filter out unwanted > incoming data on the pipeline. :) > > [1] > https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml > > [2] > https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java > > Best, > Xikui > > On Sun, Nov 25, 2018 at 1:05 PM [email protected] < > [email protected]> wrote: > > > Hi Xikui, > > > > Thanks for your response! > > We managed to cope with the problem by using the compressed version of the > > model instead, but it is still 1.6 GB. However, the project is able to > > build now :-) Yes, this is being packed into the UDF jar at the moment. Do > > you have any examples that illustrates how to use the resource file path as > > a UDF parameter? That would be very helpful! > > > > In addition, I believe that the model loading – which is now being > > executed during initialize() – restrains the incoming tweets of being > > processed. This is evident because none of the streaming elements are > > stored in AsterixDB when the model loading is included in the code, whilst > > the elements are stored when I exclude the model loading from the code. Is > > it possible to make the model load, i.e making initialize() run, prior the > > arrival of the tweets at the socketfeed? > > > > Regarding our project, we are trying to detect tweets which are relevant > > for a given "user query", where the goal is crisis detection. So we are > > trying to filter out (i.e _not_ store or keep in the pipeline) tweets which > > do not contain the relevant location etc. The model I've talked about is > > being used for word embeddings (word2vec) :-) > > > > Best regards, > > Sandra Skarshaug > > > > > > On 2018/11/24 17:55:27, Xikui Wang <[email protected]> wrote: > > > Hi Sandra, > > > > > > How big is the model file that you are using? I guess you are trying to > > > pack this model file into the UDF jar? I personally haven't seen this > > error > > > before. It feels like a Maven building with big files issue. I found this > > > thread on StackOverflow which describes the similar situation. Could you > > > try the resolutions there? > > > > > > As a side note, if you need to use a big model file in UDF, I wouldn't > > > suggest you pack that into your UDF jar file. It's because this will > > > significantly slow down your UDF installation, and you will spend a lot > > of > > > time redeploying the resource file to the cluster if you only need to > > > update the UDF code. Alternatively, you could make the resource file path > > > as a UDF parameter, and let the UDF load that file when it initializes. > > > This could make the installation much faster and avoid deploying the > > > resource file multiple times, and the packing issue should be gone as > > well. > > > :) > > > > > > PS If it's ok, could you tell us which use case that you are working on? > > We > > > would like to know how our customers use AsterixDB in different > > scenarios, > > > so we can help them (you) better! > > > > > > Best, > > > Xikui > > > > > > > > > > > > On Sat, Nov 24, 2018 at 6:05 AM [email protected] < > > > [email protected]> wrote: > > > > > > > Hi! > > > > > > > > My master thesis partner and I have added a model for word embeddings > > > > (word2vec) in our project which is quite large. This is supposed to be > > > > loaded in the initialize phase of the UDF and be used for evaluating > > the > > > > incoming records. > > > > > > > > However, when trying to build the Maven project before deploying it to > > > > AsterixDB, we get the error "Error assembling JAR, invalid entry > > size". Is > > > > this a problem anyone else have faced when for instance using machine > > > > learning models in AsterixDB? > > > > > > > > If so, we appreciate any help! > > > > > > > > Best regards, > > > > Sandra > > > > > > > > > >
