Hi Tom,
Are you comfortable trying out a trunk version?  If so, I'm wondering
if you can reproduce this on trunk - as this seems similar to an issue
recently resolved.

--tim


On Sun, Oct 5, 2014 at 4:47 PM, Tom Hood <[email protected]> wrote:
> Hi,
>
> I'm new to blur and have been spending a little time today learning the
> 0.2.3 API.  I'm having trouble dumping the terms of a blur index.
>
> Here's some code that uses Iface.terms that sort of works (see below), but
> has an issue depending on the size parameter passed to Iface.terms
>
> It wasn't obvious to me how to detect the end-of-terms condition, so if
> there's a cleaner way, please let me know.
>
>     public static void DumpTerms(Iface blurClient, String tableName)
>         throws BlurException, TException
>     {
>         Schema schema = blurClient.schema(tableName);
>         for (Map<String,ColumnDefinition> familyDef :
> schema.getFamilies().values()) {
>             for (ColumnDefinition columnDef : familyDef.values()) {
>                 DumpTermsForColumn(blurClient, tableName, columnDef);
>             }
>         }
>     }
>
>     public static void DumpTermsForColumn(Iface            blurClient,
>                                           String           tableName,
>                                           ColumnDefinition columnDef)
>         throws BlurException, TException
>     {
>         String family = columnDef.getFamily();
>         String column = columnDef.getColumnName();
>         String type = columnDef.getFieldType();
>
>         System.out.println(columnDef);
>         if (   !type.equals(TextFieldTypeDefinition.NAME)
>             && !type.equals(StringFieldTypeDefinition.NAME)) {
>             System.out.println(" WARNING: terms unavailable for type " +
> type);
>             return;
>         }
>
>         String startTerm = "";
>         int termCount = 0;
>         final short termFetchSize = 20;// loop logic assumes this is at
> least 2
>         while (true) {
>             List<String> terms = blurClient.terms(tableName,
>                                                   family,
>                                                   column,
>                                                   startTerm,
>                                                   termFetchSize);
>             if (   terms.isEmpty()
>                 || (terms.size() == 1 && terms.get(0).equals(startTerm))) {
>                 return;
>             }
>             for (String term : terms) {
>                 if (term.equals(startTerm)) {
>                     // 1st term is startTerm on calls 2-N of
> blurClient.terms
>                     continue;
>                 }
>                 if (term.isEmpty()) {
>                     // empty string returned when termFetchSize > terms left
>                     return;
>                 }
>                 startTerm = term;
>                 long termFreq = blurClient.recordFrequency(tableName,
>                                                            family,
>                                                            column,
>                                                            term);
>                 System.out.println("    term " + ++termCount
>                                    + ": [" + term + "] freq=" + termFreq);
>             }
>         }
>     }
>
> ColumnDefinition(family:technology, columnName:author, subColumnName:null,
> fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
>     term 1: [andy] freq=1
>     term 2: [beck] freq=1
>     term 3: [dave] freq=1
>     term 4: [douglas] freq=1
>     term 5: [erik] freq=2
>     term 6: [gospodnetic] freq=1
>     term 7: [hatcher] freq=2
>     term 8: [hofstadter] freq=1
>     term 9: [howard] freq=1
>     term 10: [hunt] freq=1
>     term 11: [husted] freq=1
>     term 12: [kent] freq=1
>     term 13: [lewis] freq=1
>     term 14: [loughran] freq=1
>     term 15: [massol] freq=1
>     term 16: [otis] freq=1
>     term 17: [papert] freq=1
>     term 18: [seymour] freq=1
>     term 19: [ship] freq=1
>     term 20: [steve] freq=1
>     term 21: [ted] freq=1
>     term 22: [thomas] freq=1
>     term 23: [vincent] freq=1
> ColumnDefinition(family:technology, columnName:title, subColumnName:null,
> fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
>     term 1: [action] freq=3
>     term 2: [an] freq=1
>     term 3: [ant] freq=1
>     term 4: [bach] freq=1
>     term 5: [braid] freq=1
>     term 6: [development] freq=1
>     term 7: [escher] freq=1
>     term 8: [eternal] freq=1
>     term 9: [explained] freq=1
>     term 10: [extreme] freq=1
>     term 11: [g] freq=1
>     term 12: [golden] freq=1
>     term 13: [in] freq=3
>     term 14: [java] freq=1
>     term 15: [junit] freq=1
>     term 16: [lucene] freq=1
>     term 17: [mindstorms] freq=1
>     term 18: [pragmatic] freq=1
>     term 19: [programmer] freq=1
>     term 20: [programming] freq=1
>     term 21: [tapestry] freq=1
>     term 22: [the] freq=1
>     term 23: [u00f6del] freq=1
>     term 24: [with] freq=1
> ColumnDefinition(family:technology, columnName:pubmonth, subColumnName:null,
> fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
>     term 1: [197903] freq=1
>     term 2: [198001] freq=1
>     term 3: [199910] freq=2
>     term 4: [200208] freq=1
>     term 5: [200310] freq=1
>     term 6: [200403] freq=1
>     term 7: [200406] freq=1
> ColumnDefinition(family:technology, columnName:subject, subColumnName:null,
> fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
>     term 1: [agile] freq=2
>     term 2: [ant] freq=1
>     term 3: [apache] freq=1
>     term 4: [artificial] freq=1
>     term 5: [build] freq=1
>     term 6: [children] freq=1
>     term 7: [components] freq=1
>     term 8: [computers] freq=1
>     term 9: [developer] freq=1
>     term 10: [development] freq=2
>     term 11: [driven] freq=1
>     term 12: [education] freq=1
>     term 13: [extreme] freq=1
>     term 14: [ideas] freq=1
>     term 15: [intelligence] freq=1
>     term 16: [interface] freq=1
>     term 17: [jakarta] freq=1
>     term 18: [java] freq=1
>     term 19: [junit] freq=2
>     term 20: [logo] freq=1
>     term 21: [lucene] freq=1
>     term 22: [mathematics] freq=1
>     term 23: [methodology] freq=2
>     term 24: [mock] freq=1
>     term 25: [music] freq=1
>     term 26: [number] freq=1
>     term 27: [objects] freq=1
>     term 28: [powerful] freq=1
>     term 29: [pragmatic] freq=1
>     term 30: [programming] freq=1
>     term 31: [search] freq=1
>     term 32: [tapestry] freq=1
>     term 33: [test] freq=1
>     term 34: [testing] freq=1
>     term 35: [theory] freq=1
>     term 36: [tool] freq=1
>     term 37: [tools] freq=1
>     term 38: [unit] freq=1
>     term 39: [user] freq=1
> ColumnDefinition(family:technology, columnName:isbn, subColumnName:null,
> fieldLessIndexed:false, fieldType:string, properties:null, sortable:false)
>     term 1: [020161622X] freq=1
>     term 2: [0201616416] freq=1
>     term 3: [0465026567] freq=1
>     term 4: [0465046290] freq=1
>     term 5: [1930110588] freq=1
>     term 6: [1930110995] freq=1
>     term 7: [1932394117] freq=1
>     term 8: [tbd] freq=1
> ColumnDefinition(family:technology, columnName:url, subColumnName:null,
> fieldLessIndexed:false, fieldType:text, properties:null, sortable:false)
>     term 1: [0201616416] freq=1
>     term 2: [0465026567] freq=1
>     term 3: [antbook] freq=1
>     term 4: [detail] freq=2
>     term 5: [exec] freq=2
>     term 6: [http] freq=8
>     term 7: [index.shtml] freq=1
>     term 8: [lewisship] freq=1
>     term 9: [lucene] freq=1
>     term 10: [massol] freq=1
>     term 11: [obidos] freq=2
>     term 12: [ppbook] freq=1
>     term 13: [tg] freq=2
>     term 14: [www.amazon.com] freq=2
>     term 15: [www.manning.com] freq=4
>     term 16: [www.papert.org] freq=1
>     term 17: [www.pragmaticprogrammer.com] freq=1
> Exception in thread "main" BlurException(message:Call execution exception
> [[lia, technology, url, www.pragmaticprogrammer.com, 20]],
> stackTraceStr:java.lang.ArrayIndexOutOfBoundsException: 128
> at
> org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:104)
> at
> org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextLeaf(BlockTreeTermsReader.java:2467)
> at
> org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next(BlockTreeTermsReader.java:2459)
> at
> org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next(BlockTreeTermsReader.java:2139)
> at
> org.apache.blur.index.ExitableReader$ExitableTermsEnum.next(ExitableReader.java:233)
> at org.apache.blur.manager.IndexManager.terms(IndexManager.java:1031)
> at org.apache.blur.manager.IndexManager$9.call(IndexManager.java:982)
> at org.apache.blur.manager.IndexManager$9.call(IndexManager.java:976)
> at org.apache.blur.utils.ForkJoin$2.call(ForkJoin.java:63)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at
> org.apache.blur.concurrent.ThreadWatcher$ThreadWatcherExecutorService$1.run(ThreadWatcher.java:127)
> at
> org.apache.blur.concurrent.BlurThreadPoolExecutor$1.run(BlurThreadPoolExecutor.java:83)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
> , errorType:UNKNOWN)
> at
> org.apache.blur.thrift.generated.Blur$terms_result$terms_resultStandardScheme.read(Blur.java:26728)
> at
> org.apache.blur.thrift.generated.Blur$terms_result$terms_resultStandardScheme.read(Blur.java:26696)
> at org.apache.blur.thrift.generated.Blur$terms_result.read(Blur.java:26638)
> at
> org.apache.blur.thirdparty.thrift_0_9_0.TServiceClient.receiveBase(TServiceClient.java:78)
> at org.apache.blur.thrift.generated.Blur$Client.recv_terms(Blur.java:1212)
> at
> org.apache.blur.thrift.generated.SafeClientGen.recv_terms(SafeClientGen.java:508)
> at org.apache.blur.thrift.generated.Blur$Client.terms(Blur.java:1195)
> at
> org.apache.blur.thrift.generated.SafeClientGen.terms(SafeClientGen.java:942)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler$1.call(BlurClient.java:60)
> at
> org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler$1.call(BlurClient.java:56)
> at org.apache.blur.thrift.AbstractCommand.call(AbstractCommand.java:62)
> at
> org.apache.blur.thrift.BlurClientManager.execute(BlurClientManager.java:197)
> at
> org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler.invoke(BlurClient.java:56)
> at com.sun.proxy.$Proxy0.terms(Unknown Source)
> at
> hoodware.sandbox.blur.BlurIndexMain.DumpTermsForColumn(BlurIndexMain.java:88)
> at hoodware.sandbox.blur.BlurIndexMain.DumpTerms(BlurIndexMain.java:64)
> at hoodware.sandbox.blur.BlurIndexMain.main(BlurIndexMain.java:38)
>
> The code works if I change termFetchSize to 2 instead of 20.
>
> The command "blur terms lia technology.url" will get the same exception.
>
> The command "blur terms lia technology.url -s2" will not get the exception,
> but goes into an infinite loop after it outputs: "-
> |www.pragmaticprogrammer.com "
>
> Attached is the csv file that I loaded into an empty table.  It's a
> reformatted version of the Lucene In Action book's sample data (taken from
> data directory in
> http://www.manning-source.com/books/hatcher2/LuceneInAction.zip)
>
> I created the table with the commands:
>
> hadoop fs -mkdir lia_input
> hadoop fs -copyFromLocal ~/projects/lucene/LuceneInAction.csv lia_input
> hadoop fs -mkdir tables
> blur create -t lia -c 2 -l tables/lia
>
> foreach family (health technology philosophy education)
>     blur definecolumn lia $family title text
>     blur definecolumn lia $family isbn string
>     blur definecolumn lia $family author text
> #    blur definecolumn lia $family pubmonth date -p dateFormat yyyyMM
>     blur definecolumn lia $family pubmonth text # must be text for
> Blur.Iface.terms
>     blur definecolumn lia $family subject text
>     blur definecolumn lia $family url text
> end
>
> blur csvloader -c localhost:40010 -A -a -t lia -i lia_input -s';' \
>     -d 'health title isbn author pubmonth subject url' \
>     -d 'technology title isbn author pubmonth subject url' \
>     -d 'philosophy title isbn author pubmonth subject url' \
>     -d 'education title isbn author pubmonth subject url'
>
> Please let me know if you have any ideas on what I'm doing wrong.
>
> Thanks,
> -- Tom
>
>

Reply via email to