Re: Build UDF project (Maven) with large model to deploy to AsterixDB [Error assembling jar]

sandraskarshaug Wed, 28 Nov 2018 10:10:48 -0800

Hi Xikui,

So when deploying my UDF to AsterixDB, I've put the content of the unzipped 
testlib folder into this folder:
apache-asterixdb-0.9.5-SNAPSHOT/lib/udfs/feeds/testlib/


The resulting testlib content then looks like this:
- library_descriptor.xml
- asterix-udf-template-0.1-SNAPSHOT.jar
- lib (folder with external dependencies)

However, since the dependencies from this /lib folder ought to be copied into 
apache-asterixdb-0.9.5-SNAPSHOT/repo instead, should I delete the 
apache-asterixdb-0.9.5-SNAPSHOT/lib/udfs/feeds/testlib/lib folder which are 
created when dropping the unzipped UDF package inside testlib, or keep the 
dependencies both there and in /repo?

Thanks!


On 2018/11/28 17:03:31, Xikui Wang <[email protected]> wrote: 
> Hi Sandra,
> 
> If you are following the binary-assembly-libzip.xml that you showed to me
> earlier, the specified dependency jars should be under the lib directory in
> your compiled UDF package, i.e., "- lib (dictionary containing .jars for my
> dependencies listed above in binary-assembly-libzip.xml)". You can copy all
> the jar files in this directory to the repo directory in AsterixDB. That
> would work. As for the repacking part, that was for those who want to
> distribute their patched AsterixDB to their users. In your case, you can
> ignore that.
> 
> Best,
> Xikui
> 
> On Wed, Nov 28, 2018 at 1:51 AM [email protected] <
> [email protected]> wrote:
> 
> > Hi, thanks again Xikui!
> >
> > I am trying the latter option now – dropping the dependency jars into the
> > /repo folder. Does it have anything to say where I copy the dependency jars
> > from?
> >
> > In addition, I think I should provide some context of my locally run
> > instance of AsterixDB:
> > - I have cloned the asterixdb repo from github, so I have it local on my
> > Macbook Pro.
> > - Inside the cloned folder,
> > asterixdb/asterixdb/asterix-server/target/asterix-server-0.9.5-SNAPSHOT-binary-assembly
> > folder, there lies a folder called apache-asterixdb-0.9.5-SNAPSHOT, which
> > in turn contains the folders bin, etc, lib, opt and repo.
> > - It is inside _this_ repo folder I am putting the dependency jars.
> > - It is from this /opt/local/bin folder I am running sh
> > start-sample-cluster.sh
> >
> > So, when following the option 2 example provided in your link [1], it says
> > to repach this folder into a zip again. I don't quite get this, as this is
> > the folder I am using to run AsterixDB?
> >
> > Thanks in advance!
> >
> > Best regards,
> > Sandra
> >
> >
> > On 2018/11/27 16:38:23, Xikui Wang <[email protected]> wrote:
> > > The configuration seems alright, but it's very hard to say where the
> > > problem is since I haven't had the chance to see what is exactly in your
> > > lib directory. If this packaging doesn't work for you, you can try to
> > pack
> > > the dependencies into the UDF jar as a single fat jar, or you can drop
> > the
> > > dependency jars into the "asterix-server-0.9.*-binary-assembly/repo"
> > directory,
> > > so they can be distributed with the AsterixDB instance. I would recommend
> > > the latter method, as you don't have to redeploy the dependency jars
> > every
> > > time when a UDF changes. These two methods are described in the
> > > documentation of the UDF template repo [1]. :)
> > >
> > > [1] https://github.com/idleft/asterix-udf-template
> > >
> > > Best,
> > > Xikui
> > >
> > > On Tue, Nov 27, 2018 at 6:04 AM [email protected] <
> > > [email protected]> wrote:
> > >
> > > > Thank you for making sense of the log file for me, I managed to get the
> > > > parameters work!
> > > >
> > > > However, a new challenge became evident, of course. The new error that
> > I
> > > > am seeing (java.lang.ClassNotFoundException in the cc.log when trying
> > to
> > > > use one of the dependencies in my code). I think this may be happening
> > due
> > > > to the external dependency, and if it is reachable or not from my UDF
> > when
> > > > running locally on AsterixDB. Could you explain if my approach for
> > > > including external dependencies are right or not (approach/steps listed
> > > > below)?
> > > >
> > > > 1. The binary-assembly-libzip.xml looks like this, where the
> > dependencies
> > > > are included at the bottom:
> > > >
> > > > <assembly>
> > > >   <id>testlib</id>
> > > >   <formats>
> > > >     <format>zip</format>
> > > >   </formats>
> > > >   <includeBaseDirectory>false</includeBaseDirectory>
> > > >   <fileSets>
> > > >     <fileSet>
> > > >       <directory>target</directory>
> > > >       <outputDirectory/>
> > > >       <includes>
> > > >         <include>*.jar</include>
> > > >       </includes>
> > > >     </fileSet>
> > > >     <fileSet>
> > > >       <directory>src/main/resources</directory>
> > > >       <outputDirectory/>
> > > >       <includes>
> > > >         <include>library_descriptor.xml</include>
> > > >       </includes>
> > > >     </fileSet>
> > > >   </fileSets>
> > > >   <dependencySets>
> > > >     <dependencySet>
> > > >       <includes>
> > > >         <include>commons-io:commons-io</include>
> > > >         <include>ch.qos.logback:logback-core</include>
> > > >         <include>org.slf4j:slf4j-api</include>
> > > >         <include>ch.qos.logback:logback-classic</include>
> > > >         <include>org.deeplearning4j:deeplearning4j-core</include>
> > > >
> >  <include>org.deeplearning4j:deeplearning4j-modelimport</include>
> > > >         <include>org.deeplearning4j:deeplearning4j-nlp</include>
> > > >         <include>org.nd4j:nd4j-api</include>
> > > >         <include>org.nd4j:nd4j-native</include>
> > > >       </includes>
> > > >       <unpack>false</unpack>
> > > >       <outputDirectory>lib</outputDirectory>
> > > >     </dependencySet>
> > > >   </dependencySets>
> > > > </assembly>
> > > >
> > > > 2. When the Maven project is built (mvn clean install), it generates
> > files
> > > > in /target:
> > > > - asterix-udf-template-0.1-SNAPSHOT-testlib.zip
> > > > - asterix-udf-template-0.1-SNAPSHOT.jar
> > > > - archive-tmp
> > > > - classes
> > > > - generated-sources
> > > > - maven-archiver
> > > > - maven-status
> > > >
> > > > 3. When unzipping the uppermost file (testlib), it contains:
> > > > - lib (dictionary containing .jars for my dependencies listed above in
> > > > binary-assembly-libzip.xml)
> > > > - library_descriptor.xml
> > > > - asterix-udf-template-0.1-SNAPSHOT.jar
> > > >
> > > > 4. And when unzipping the bottommost .jar inside the testlib here, it
> > > > contains:
> > > > - my model (model.bin.gz)
> > > > - library_descriptor.xml
> > > > - META-INF
> > > > - org.apache.asterix.external
> > > > ----> contains my classes
> > > >
> > > > Does this look right?
> > > >
> > > > I appreciate your help!
> > > >
> > > > Best regards,
> > > > Sandra
> > > >
> > > > On 2018/11/27 06:38:58, Xikui Wang <[email protected]> wrote:
> > > > > Hi Sandra,
> > > > >
> > > > > Based on the log, it seems you have an IndexOutOfBoundsException in
> > your
> > > > > UDF code. Can you double check your UDF at
> > > > >
> > > >
> > org.apache.asterix.external.library.RelevanceDetecterFunction.initialize(RelevanceDetecterFunction.java:33)
> > > > > and your UDF configuration file? You will have to make sure the
> > > > parameters
> > > > > are specified properly in the config file, and they are properly
> > accessed
> > > > > in the initialize method.
> > > > >
> > > > > Best,
> > > > > Xikui
> > > > >
> > > > > On Mon, Nov 26, 2018 at 1:33 PM [email protected] <
> > > > > [email protected]> wrote:
> > > > >
> > > > > > Hi Xikui!
> > > > > >
> > > > > > So I tried to add the resource as a parameter. However, I get this
> > > > error
> > > > > > (gist with log from cc.log) [1] when the query below is executed:
> > > > > >
> > > > > > USE feeds;
> > > > > > CONNECT FEED TestSocketFeed TO DATASET RelevantDataset
> > > > > > APPLY function testlib#detectRelevance; start feed TestSocketFeed
> > > > > >
> > > > > > To provide some context, this query works as it should when I don't
> > > > > > include the model.
> > > > > >
> > > > > > [1]
> > > > https://gist.github.com/sandraskars/3f707d9b07e5b6c1006368a297b6eacb
> > > > > >
> > > > > > Best regards,
> > > > > > Sandra
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2018/11/26 05:45:03, Xikui Wang <[email protected]> wrote:
> > > > > > > Hi Sandra,
> > > > > > >
> > > > > > > Here is an example for adding parameters to a UDF [1]. As you can
> > > > see,
> > > > > > the
> > > > > > > function "KeywordsDetectorFactory" reads a given list path from
> > a UDF
> > > > > > > parameter. You can use this to reuse a Java function with
> > different
> > > > > > > resource files. This function is contained in the AsterixDB
> > release
> > > > as
> > > > > > > well. Please make sure the path to the resource file is correct
> > when
> > > > you
> > > > > > > use it. That's a tricky part that I always make mistakes.
> > > > > > >
> > > > > > > The initialize(), i.e. the model loading, is executed when the
> > "start
> > > > > > feed"
> > > > > > > statement is executed. This doesn't require Tweets to come. Is
> > that
> > > > the
> > > > > > > case you are referring to?
> > > > > > >
> > > > > > > As for your use case, here is an interesting thing that you can
> > try.
> > > > > > There
> > > > > > > is a feature in the data feeds which is currently not in our
> > > > > > documentation,
> > > > > > > which is to allow you to filter out incoming data by query
> > > > predicates. If
> > > > > > > you want to filter out Tweets with the model file that you
> > trained,
> > > > you
> > > > > > can
> > > > > > > attach a Java UDF on your ingestion pipeline with the following
> > > > query:
> > > > > > >
> > > > > > > use test;
> > > > > > > create type InputRecordType as closed {
> > > > > > > id:int64,
> > > > > > > fname:string,
> > > > > > > lname:string,
> > > > > > > age:int64,
> > > > > > > dept:string
> > > > > > > };
> > > > > > > create dataset EmpDataset(InputRecordType) primary key id;
> > > > > > > create feed UserFeed with {
> > > > > > >     "adapter-name" : "socket_adapter",
> > > > > > >     "sockets" : "127.0.0.1:10001",
> > > > > > >     "address-type" : "IP",
> > > > > > >     "type-name" : "InputRecordType",
> > > > > > >     "format" : "delimited-text",
> > > > > > >     "delimiter" : "|",
> > > > > > >     "upsert-feed" : "true"
> > > > > > > };
> > > > > > > *connect feed UserFeed to dataset EmpDataset WHERE
> > > > > > > testlib#wordDetector(fname) = TRUE;*
> > > > > > > start feed UserFeed;
> > > > > > >
> > > > > > > The Java UDF used here is in [2]. This can help you filter out
> > > > unwanted
> > > > > > > incoming data on the pipeline. :)
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > >
> > https://github.com/idleft/asterix-udf-template/blob/master/src/main/resources/library_descriptor.xml
> > > > > > >
> > > > > > > [2]
> > > > > > >
> > > > > >
> > > >
> > https://github.com/idleft/asterix-udf-template/blob/master/src/main/java/org/apache/asterix/external/library/WordInListFunction.java
> > > > > > >
> > > > > > > Best,
> > > > > > > Xikui
> > > > > > >
> > > > > > > On Sun, Nov 25, 2018 at 1:05 PM [email protected] <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > > > Hi Xikui,
> > > > > > > >
> > > > > > > > Thanks for your response!
> > > > > > > > We managed to cope with the problem by using the compressed
> > > > version of
> > > > > > the
> > > > > > > > model instead, but it is still 1.6 GB. However, the project is
> > > > able to
> > > > > > > > build now :-) Yes, this is being packed into the UDF jar at the
> > > > > > moment.  Do
> > > > > > > > you have any examples that illustrates how to use the resource
> > file
> > > > > > path as
> > > > > > > > a UDF parameter? That would be very helpful!
> > > > > > > >
> > > > > > > > In addition, I believe that the model loading – which is now
> > being
> > > > > > > > executed during initialize() – restrains the incoming tweets of
> > > > being
> > > > > > > > processed. This is evident because none of the streaming
> > elements
> > > > are
> > > > > > > > stored in AsterixDB when the model loading is included in the
> > code,
> > > > > > whilst
> > > > > > > > the elements are stored when I exclude the model loading from
> > the
> > > > > > code. Is
> > > > > > > > it possible to make the model load, i.e making initialize()
> > run,
> > > > prior
> > > > > > the
> > > > > > > > arrival of the tweets at the socketfeed?
> > > > > > > >
> > > > > > > > Regarding our project, we are trying to detect tweets which are
> > > > > > relevant
> > > > > > > > for a given "user query", where the goal is crisis detection.
> > So
> > > > we are
> > > > > > > > trying to filter out (i.e _not_ store or keep in the pipeline)
> > > > tweets
> > > > > > which
> > > > > > > > do not contain the relevant location etc. The model I've talked
> > > > about
> > > > > > is
> > > > > > > > being used for word embeddings (word2vec) :-)
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Sandra Skarshaug
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2018/11/24 17:55:27, Xikui Wang <[email protected]> wrote:
> > > > > > > > > Hi Sandra,
> > > > > > > > >
> > > > > > > > > How big is the model file that you are using? I guess you are
> > > > trying
> > > > > > to
> > > > > > > > > pack this model file into the UDF jar? I personally haven't
> > seen
> > > > this
> > > > > > > > error
> > > > > > > > > before. It feels like a Maven building with big files issue.
> > I
> > > > found
> > > > > > this
> > > > > > > > > thread on StackOverflow which describes the similar
> > situation.
> > > > Could
> > > > > > you
> > > > > > > > > try the resolutions there?
> > > > > > > > >
> > > > > > > > > As a side note, if you need to use a big model file in UDF, I
> > > > > > wouldn't
> > > > > > > > > suggest you pack that into your UDF jar file. It's because
> > this
> > > > will
> > > > > > > > > significantly slow down your UDF installation, and you will
> > > > spend a
> > > > > > lot
> > > > > > > > of
> > > > > > > > > time redeploying the resource file to the cluster if you only
> > > > need to
> > > > > > > > > update the UDF code. Alternatively, you could make the
> > resource
> > > > file
> > > > > > path
> > > > > > > > > as a UDF parameter, and let the UDF load that file when it
> > > > > > initializes.
> > > > > > > > > This could make the installation much faster and avoid
> > deploying
> > > > the
> > > > > > > > > resource file multiple times, and the packing issue should be
> > > > gone as
> > > > > > > > well.
> > > > > > > > > :)
> > > > > > > > >
> > > > > > > > > PS If it's ok, could you tell us which use case that you are
> > > > working
> > > > > > on?
> > > > > > > > We
> > > > > > > > > would like to know how our customers use AsterixDB in
> > different
> > > > > > > > scenarios,
> > > > > > > > > so we can help them (you) better!
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Xikui
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Nov 24, 2018 at 6:05 AM [email protected] <
> > > > > > > > > [email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > Hi!
> > > > > > > > > >
> > > > > > > > > > My master thesis partner and I have added a model for word
> > > > > > embeddings
> > > > > > > > > > (word2vec) in our project which is quite large. This is
> > > > supposed
> > > > > > to be
> > > > > > > > > > loaded in the initialize phase of the UDF and be used for
> > > > > > evaluating
> > > > > > > > the
> > > > > > > > > > incoming records.
> > > > > > > > > >
> > > > > > > > > > However, when trying to build the Maven project before
> > > > deploying
> > > > > > it to
> > > > > > > > > > AsterixDB, we get the error "Error assembling JAR, invalid
> > > > entry
> > > > > > > > size". Is
> > > > > > > > > > this a problem anyone else have faced when for instance
> > using
> > > > > > machine
> > > > > > > > > > learning models in AsterixDB?
> > > > > > > > > >
> > > > > > > > > > If so, we appreciate any help!
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Sandra
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Build UDF project (Maven) with large model to deploy to AsterixDB [Error assembling jar]

Reply via email to