Hi Tom, Are you comfortable trying out a trunk version? If so, I'm wondering if you can reproduce this on trunk - as this seems similar to an issue recently resolved.
--tim On Sun, Oct 5, 2014 at 4:47 PM, Tom Hood <[email protected]> wrote: > Hi, > > I'm new to blur and have been spending a little time today learning the > 0.2.3 API. I'm having trouble dumping the terms of a blur index. > > Here's some code that uses Iface.terms that sort of works (see below), but > has an issue depending on the size parameter passed to Iface.terms > > It wasn't obvious to me how to detect the end-of-terms condition, so if > there's a cleaner way, please let me know. > > public static void DumpTerms(Iface blurClient, String tableName) > throws BlurException, TException > { > Schema schema = blurClient.schema(tableName); > for (Map<String,ColumnDefinition> familyDef : > schema.getFamilies().values()) { > for (ColumnDefinition columnDef : familyDef.values()) { > DumpTermsForColumn(blurClient, tableName, columnDef); > } > } > } > > public static void DumpTermsForColumn(Iface blurClient, > String tableName, > ColumnDefinition columnDef) > throws BlurException, TException > { > String family = columnDef.getFamily(); > String column = columnDef.getColumnName(); > String type = columnDef.getFieldType(); > > System.out.println(columnDef); > if ( !type.equals(TextFieldTypeDefinition.NAME) > && !type.equals(StringFieldTypeDefinition.NAME)) { > System.out.println(" WARNING: terms unavailable for type " + > type); > return; > } > > String startTerm = ""; > int termCount = 0; > final short termFetchSize = 20;// loop logic assumes this is at > least 2 > while (true) { > List<String> terms = blurClient.terms(tableName, > family, > column, > startTerm, > termFetchSize); > if ( terms.isEmpty() > || (terms.size() == 1 && terms.get(0).equals(startTerm))) { > return; > } > for (String term : terms) { > if (term.equals(startTerm)) { > // 1st term is startTerm on calls 2-N of > blurClient.terms > continue; > } > if (term.isEmpty()) { > // empty string returned when termFetchSize > terms left > return; > } > startTerm = term; > long termFreq = blurClient.recordFrequency(tableName, > family, > column, > term); > System.out.println(" term " + ++termCount > + ": [" + term + "] freq=" + termFreq); > } > } > } > > ColumnDefinition(family:technology, columnName:author, subColumnName:null, > fieldLessIndexed:false, fieldType:text, properties:null, sortable:false) > term 1: [andy] freq=1 > term 2: [beck] freq=1 > term 3: [dave] freq=1 > term 4: [douglas] freq=1 > term 5: [erik] freq=2 > term 6: [gospodnetic] freq=1 > term 7: [hatcher] freq=2 > term 8: [hofstadter] freq=1 > term 9: [howard] freq=1 > term 10: [hunt] freq=1 > term 11: [husted] freq=1 > term 12: [kent] freq=1 > term 13: [lewis] freq=1 > term 14: [loughran] freq=1 > term 15: [massol] freq=1 > term 16: [otis] freq=1 > term 17: [papert] freq=1 > term 18: [seymour] freq=1 > term 19: [ship] freq=1 > term 20: [steve] freq=1 > term 21: [ted] freq=1 > term 22: [thomas] freq=1 > term 23: [vincent] freq=1 > ColumnDefinition(family:technology, columnName:title, subColumnName:null, > fieldLessIndexed:false, fieldType:text, properties:null, sortable:false) > term 1: [action] freq=3 > term 2: [an] freq=1 > term 3: [ant] freq=1 > term 4: [bach] freq=1 > term 5: [braid] freq=1 > term 6: [development] freq=1 > term 7: [escher] freq=1 > term 8: [eternal] freq=1 > term 9: [explained] freq=1 > term 10: [extreme] freq=1 > term 11: [g] freq=1 > term 12: [golden] freq=1 > term 13: [in] freq=3 > term 14: [java] freq=1 > term 15: [junit] freq=1 > term 16: [lucene] freq=1 > term 17: [mindstorms] freq=1 > term 18: [pragmatic] freq=1 > term 19: [programmer] freq=1 > term 20: [programming] freq=1 > term 21: [tapestry] freq=1 > term 22: [the] freq=1 > term 23: [u00f6del] freq=1 > term 24: [with] freq=1 > ColumnDefinition(family:technology, columnName:pubmonth, subColumnName:null, > fieldLessIndexed:false, fieldType:text, properties:null, sortable:false) > term 1: [197903] freq=1 > term 2: [198001] freq=1 > term 3: [199910] freq=2 > term 4: [200208] freq=1 > term 5: [200310] freq=1 > term 6: [200403] freq=1 > term 7: [200406] freq=1 > ColumnDefinition(family:technology, columnName:subject, subColumnName:null, > fieldLessIndexed:false, fieldType:text, properties:null, sortable:false) > term 1: [agile] freq=2 > term 2: [ant] freq=1 > term 3: [apache] freq=1 > term 4: [artificial] freq=1 > term 5: [build] freq=1 > term 6: [children] freq=1 > term 7: [components] freq=1 > term 8: [computers] freq=1 > term 9: [developer] freq=1 > term 10: [development] freq=2 > term 11: [driven] freq=1 > term 12: [education] freq=1 > term 13: [extreme] freq=1 > term 14: [ideas] freq=1 > term 15: [intelligence] freq=1 > term 16: [interface] freq=1 > term 17: [jakarta] freq=1 > term 18: [java] freq=1 > term 19: [junit] freq=2 > term 20: [logo] freq=1 > term 21: [lucene] freq=1 > term 22: [mathematics] freq=1 > term 23: [methodology] freq=2 > term 24: [mock] freq=1 > term 25: [music] freq=1 > term 26: [number] freq=1 > term 27: [objects] freq=1 > term 28: [powerful] freq=1 > term 29: [pragmatic] freq=1 > term 30: [programming] freq=1 > term 31: [search] freq=1 > term 32: [tapestry] freq=1 > term 33: [test] freq=1 > term 34: [testing] freq=1 > term 35: [theory] freq=1 > term 36: [tool] freq=1 > term 37: [tools] freq=1 > term 38: [unit] freq=1 > term 39: [user] freq=1 > ColumnDefinition(family:technology, columnName:isbn, subColumnName:null, > fieldLessIndexed:false, fieldType:string, properties:null, sortable:false) > term 1: [020161622X] freq=1 > term 2: [0201616416] freq=1 > term 3: [0465026567] freq=1 > term 4: [0465046290] freq=1 > term 5: [1930110588] freq=1 > term 6: [1930110995] freq=1 > term 7: [1932394117] freq=1 > term 8: [tbd] freq=1 > ColumnDefinition(family:technology, columnName:url, subColumnName:null, > fieldLessIndexed:false, fieldType:text, properties:null, sortable:false) > term 1: [0201616416] freq=1 > term 2: [0465026567] freq=1 > term 3: [antbook] freq=1 > term 4: [detail] freq=2 > term 5: [exec] freq=2 > term 6: [http] freq=8 > term 7: [index.shtml] freq=1 > term 8: [lewisship] freq=1 > term 9: [lucene] freq=1 > term 10: [massol] freq=1 > term 11: [obidos] freq=2 > term 12: [ppbook] freq=1 > term 13: [tg] freq=2 > term 14: [www.amazon.com] freq=2 > term 15: [www.manning.com] freq=4 > term 16: [www.papert.org] freq=1 > term 17: [www.pragmaticprogrammer.com] freq=1 > Exception in thread "main" BlurException(message:Call execution exception > [[lia, technology, url, www.pragmaticprogrammer.com, 20]], > stackTraceStr:java.lang.ArrayIndexOutOfBoundsException: 128 > at > org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:104) > at > org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.nextLeaf(BlockTreeTermsReader.java:2467) > at > org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum$Frame.next(BlockTreeTermsReader.java:2459) > at > org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader$SegmentTermsEnum.next(BlockTreeTermsReader.java:2139) > at > org.apache.blur.index.ExitableReader$ExitableTermsEnum.next(ExitableReader.java:233) > at org.apache.blur.manager.IndexManager.terms(IndexManager.java:1031) > at org.apache.blur.manager.IndexManager$9.call(IndexManager.java:982) > at org.apache.blur.manager.IndexManager$9.call(IndexManager.java:976) > at org.apache.blur.utils.ForkJoin$2.call(ForkJoin.java:63) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > org.apache.blur.concurrent.ThreadWatcher$ThreadWatcherExecutorService$1.run(ThreadWatcher.java:127) > at > org.apache.blur.concurrent.BlurThreadPoolExecutor$1.run(BlurThreadPoolExecutor.java:83) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > , errorType:UNKNOWN) > at > org.apache.blur.thrift.generated.Blur$terms_result$terms_resultStandardScheme.read(Blur.java:26728) > at > org.apache.blur.thrift.generated.Blur$terms_result$terms_resultStandardScheme.read(Blur.java:26696) > at org.apache.blur.thrift.generated.Blur$terms_result.read(Blur.java:26638) > at > org.apache.blur.thirdparty.thrift_0_9_0.TServiceClient.receiveBase(TServiceClient.java:78) > at org.apache.blur.thrift.generated.Blur$Client.recv_terms(Blur.java:1212) > at > org.apache.blur.thrift.generated.SafeClientGen.recv_terms(SafeClientGen.java:508) > at org.apache.blur.thrift.generated.Blur$Client.terms(Blur.java:1195) > at > org.apache.blur.thrift.generated.SafeClientGen.terms(SafeClientGen.java:942) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler$1.call(BlurClient.java:60) > at > org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler$1.call(BlurClient.java:56) > at org.apache.blur.thrift.AbstractCommand.call(AbstractCommand.java:62) > at > org.apache.blur.thrift.BlurClientManager.execute(BlurClientManager.java:197) > at > org.apache.blur.thrift.BlurClient$BlurClientInvocationHandler.invoke(BlurClient.java:56) > at com.sun.proxy.$Proxy0.terms(Unknown Source) > at > hoodware.sandbox.blur.BlurIndexMain.DumpTermsForColumn(BlurIndexMain.java:88) > at hoodware.sandbox.blur.BlurIndexMain.DumpTerms(BlurIndexMain.java:64) > at hoodware.sandbox.blur.BlurIndexMain.main(BlurIndexMain.java:38) > > The code works if I change termFetchSize to 2 instead of 20. > > The command "blur terms lia technology.url" will get the same exception. > > The command "blur terms lia technology.url -s2" will not get the exception, > but goes into an infinite loop after it outputs: "- > |www.pragmaticprogrammer.com " > > Attached is the csv file that I loaded into an empty table. It's a > reformatted version of the Lucene In Action book's sample data (taken from > data directory in > http://www.manning-source.com/books/hatcher2/LuceneInAction.zip) > > I created the table with the commands: > > hadoop fs -mkdir lia_input > hadoop fs -copyFromLocal ~/projects/lucene/LuceneInAction.csv lia_input > hadoop fs -mkdir tables > blur create -t lia -c 2 -l tables/lia > > foreach family (health technology philosophy education) > blur definecolumn lia $family title text > blur definecolumn lia $family isbn string > blur definecolumn lia $family author text > # blur definecolumn lia $family pubmonth date -p dateFormat yyyyMM > blur definecolumn lia $family pubmonth text # must be text for > Blur.Iface.terms > blur definecolumn lia $family subject text > blur definecolumn lia $family url text > end > > blur csvloader -c localhost:40010 -A -a -t lia -i lia_input -s';' \ > -d 'health title isbn author pubmonth subject url' \ > -d 'technology title isbn author pubmonth subject url' \ > -d 'philosophy title isbn author pubmonth subject url' \ > -d 'education title isbn author pubmonth subject url' > > Please let me know if you have any ideas on what I'm doing wrong. > > Thanks, > -- Tom > >
