I have just 3 chars to contribute: WOW

Otis



On Thu, Sep 26, 2013 at 8:29 AM, Steve Rowe <sar...@gmail.com> wrote:
> Except for #1/#34 - internal links to beginning-of-page sections point one 
> page earlier than they should - and #8/#41 - missing Thai and Polish chars - 
> which I don't know how to fix, I'll try to address the other items on this 
> (um, very long) list of mostly minor stuff I found:
>
> 0. All examples in the exported PDF have an extra blank line at the top.  I 
> was able to eliminate these from this page 
> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604227> 
> ("What is an analyzer?") by eliminating the newline between the initial {code 
> …} line and the first line of the examples.  This doesn't have any apparent 
> effect on the layout of the page on the wiki, but the PDF export of that page 
> no longer has the extra blank lines.  Any objections to switching all {code} 
> examples in the guide like this?
>
> 1. Pg 2: The section links from the TOC all take you to the previous page, 
> rather than to the top of the page where the section starts.  (Same behavior 
> on OS X Preview, and under Windows, on Firefox's built-in PDF viewer and on 
> Adobe Reader.)  This looks like a general problem - see e.g. #34.
>
> 2. Pg 68: Stray asterisks in the <analyzer> tags in the <fieldType> example 
> under "Analysis Phases", apparently to make the surrounded text bold (which 
> also didn't happen).
>
> 3. Pg 69: The solr.KeywordTokenizerFactory example is missing one quotation 
> mark from each of the left and right hand sides.
>
> 4. Pg 70: Under "solr.TokenizerFactory", there is a bogus "StandardTokenizer" 
> link in the sentence "Theere aren't any filters that use StandardTokenizer's 
> types" - the link is to the non-existent "StandardTokenizer" page on the Solr 
> wiki.  (It might be useful to systematically link stuff like this to the 
> corresponding Lucene or Solr javadocs, but this should probably be templated 
> or scripted, so that the version-specific links are handled properly.)
>
> 5. Pg 71: Under "Standard Tokenizer", the email addresses recognition claim 
> is false, and Internet domain name recognition isn't validation per se, e.g. 
> "google.supercomputername" will be tokenized as a single token along with 
> "google.com".  The "Out" example output needs fixup accordingly.  I see that 
> the "Classic Tokenizer" section on pg 72 has the verbatim email/domain text; 
> for ClassicTokenizer, the email claim is true, but it has the same issue with 
> internet domain names as StandardTokenizer.
>
> 6. Pg 74: The NGram Tokenizer example output should be ("bicy", "bicyc", 
> "icyc", "icycl", "cycl", "cycle", "ycle") instead of all of the 4grams before 
> the 5grams (I think this class's behavior was changed in 4.4 by LUCENE-5042).
>
> 7. Pg 75: The ICU tokenizer "rulefiles" argument is missing.
>
> 8. Pg 75: The ICU Tokenizer's "In" input and "Out" output are completely 
> missing the Thai text that's visible on the wiki.
>
> 9. Pg 75: Missing spaces in the Regular Expression Pattern Tokenizer's 
> "group" attribute description, at the boundaries between the first two 
> sentences: "token(s).The" and "tokens.Non-negative".
>
> 10. Pg 72, 76, 77, etc.: Many analysis components' factory class names should 
> be styled with a fixed-width font.
>
> 11. Pg 77: UAX29 URL Email Tokenizer recognizes not only .com Internet domain 
> names, but also domain names including any other valid top-level domain 
> (i.e., unlike StandardTokenizer and ClassicTokenizer, domain names are 
> validated against the white list drawn from the IANA Root Zone database 
> <http://www.internic.net/zones/root.zone> as of the last time "ant gen-tld" 
> was performed and the tokenizer was generated.)
>
> 12. Pg 77: UAX29 tokenizer: "file:://" should be "file://"
>
> 13. Pg 77: UAX29 tokenizer's <URL> and <EMAIL> type names are missing angle 
> brackets.
>
> 14. Pg 77: UAX29 tokenizer's maxTokenLength attribute name should be styled 
> with a fixed-width font.
>
> 15. Pg 78: In the example demonstrating how arguments can be given to 
> <filter> elements via attributes, there is a stray asterisk, apparently 
> intended to bold the surrounding text, which also didn't work: *min="2" 
> max="7"/>
>
> 16. Pg 79: The ASCII Folding Filter's "Out" output should have the accent 
> stripped from the "á" -> "a" and the ASCII character value adjusted -> (ASCII 
> character 97)
>
> 17. Pg 81: The Edge N-gram Filter's 4-6 gram size example "Out" should be 
> ("four", "scor", "score", "twen", "twent", "twenty") - some of these are 
> missing.
>
> 18. Pg 83: The ICU Normalizer 2 Filter example should include the "name" and 
> "mode" attributes in the <filter> element.
>
> 19. Pg 87: Stray asterisks in both of the N-Gram Filter examples: 
> *minGramSize="...
>
> 20. Pg 87: The N-Gram Filter 3-5 gram size example "Out" output should be 
> ("fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore") - rather 
> than ordering by gram size, output is now ordered first by position and then 
> by gram size.
>
> 21. Pg 88: Stray asterisk in the first occurrence only example of the Pattern 
> Replace Filter: *replace="first".
>
> 22. Pg 89: "encoder" argument to the Phonetic Filter has surrounding double 
> curly brackets instead of being styled with a fixed-width font.
>
> 23. Pg 90: It should be mentioned on Porter Stem Filter that it's *four times 
> faster* than the English Snowball stemmer - I benchmarked it at 
> <http://markmail.org/thread/d2c443z63z37rwf6>
>
> 24. Pg 90: The Position Filter Factory is deprecated and will be removed in 
> 5.0 - this should be mentioned.
>
> 25. Pg 90: The Position Filter Factory example has the wrong token position 
> on the second token - it should be 2 instead of 3.
>
> 26. Pg 90: The "testsyns.txt" file contents are missing from Remove 
> Duplicates Token Filter.
>
> 27. Pg 92: Shingle Filter is missing params "minShingleSize", 
> "outputUnigramsIfNoShingles", and "tokenSeparator".
>
> 28. Pg 93: Standard Filter: as of lucene match version 3.1, this filter is a 
> no-op.
>
> 29. Pg 94: Stop Filter: the "enablePositionIncrements" arg is no longer 
> supported as of Lucene/Solr 4.4 - this should be mentioned, and the example 
> showing its use should be removed.  All of the examples need to have their 
> positions adjusted accordingly.  Also, all language-specific examples later 
> in the guide should have this arg removed.
>
> 30. Pg 97: Word Delimiter Filter: "-hotspot" is crossed out - the leading 
> hyphen needs to be escaped or something.
>
> 31. Pg 97: WDF: Missing period+space in the "splitOnCaseChange" arg 
> description: "XL"Example 1
>
> 32. Pg 97: WDF: "though" -> "through" in "protected" arg description.
>
> 33. Pg 98: CharFilterFactories: weird wording in "Char Filters can add, 
> change, or remove characters without worrying about fault of Token offsets." 
> - better: "Char Filters can add, change, or remove characters while 
> preserving original character offsets to support e.g. highlighting."
>
> 34. Pg 99&100: Under solr.HTMLStripCharFilterFactory, the links labeled 
> "Major Changes from Solr 3 to Solr 4." go one page previous to the start of 
> this section in the guide.
>
> 35. Pg 100: solr.HTMLStripCharFilterFactory: this is incorrect: "Inline tags, 
> such as <b>, <i>, or <span> will be replaced by a space."  It should be: 
> "Inline tags, such as <b>, <i>, or <span> will be removed - no space or 
> newline will be substituted."
>
> 36. Pg 100: solr.PatternReplaceCharFilterFactory: All of the "replaceWith" 
> column contents are missing backslashes; some have commas that shouldn't be 
> there; and some have curly brackets that shouldn't be there.
>
> 37. Pg 101: Dictionary Compound Word Token Filter: the content of 
> "germanwords.txt" ("dummkopfdonaudampfschiff") is missing spaces or newlines 
> between words - it should be "dumm kopf donau dampf schiff" instead.
>
> 38. Pg 102: Under "Unicode Collation", s/that also be used/that also *can* be 
> used/ in "Unicode Collation is a language-sensitive method of sorting text 
> that also be used for advanced search purposes."
>
> 39. Pg 102&103: Under "Sorting Text for a Specific Language", in the sentence 
> "You can see a list of supported Locales _here_", the link is to a list of 
> supported locales under Java 5.  The equivalent Java 6 link is 
> <http://www.oracle.com/technetwork/java/javase/locales-137662.html>.  
> Similarly, the Collator javadocs link in the sentence "For more information, 
> see the _Collator javadocs_", the link is to the Java 5 javadocs - the 
> equivalent Java 6 link is 
> <http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html>.  
> Similarly, under "Sorting Text with Custom Rules", the RuleBasedCollator 
> javadocs link in the sentence "For more information, see the 
> _RuleBasedCollator javadocs_" is to the Java 5 javadocs - the equivalent Java 
> 6 link is 
> <http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html>.
>
> 40. Pg 102-105: Under Unicode Collation: (ICU)CollationFilterFactory have 
> been deprecated (and will be removed in 5.0) in favor of (ICU)CollationField, 
> which will need descriptions and examples.
>
> 41. Pg 105: Under Collation Key Filter, several city names in the result 
> example are missing characters with diacritics: "Białystok" is missing its 
> "ł", "Łowicz" is missing its "Ł", and "Świdnik" is missing its "Ś".
>
> 42. Pg 106: ISO Latin Accent Filter: this class is no longer present as of 
> Solr 4.0 - this section should be replaced with one about ASCIIFoldingFilter. 
>  Also, the solr.MappingCharFilterFactory section on Pg 99 should be changed 
> to use "mapping-FoldToASCII.txt" instead of "mapping-ISOLatin1Accent.txt".
>
> 43. Pg 106: Langauge-Specific Factories: Catalan, Danish, Irish and Romanian 
> are missing from the covered languages; Catalan and Irish should include 
> ElisionFilterFactory in their examples - there are articles lists in Lucene's 
> {Catalan,Irish}Analyzer.
>
> 44. Pg 107-120: Example anlyzers for the following languages don't include a 
> <tokenizer> - they should include StandardTokenizer: Arabic, Bulgarian, 
> Czech, Galician, Hindi, Indonesian, Italian, Persian, Polish, Swedish, 
> Spanish, and Turkish.
>
> 45. Pg 109-112: The Dutch, Finnish and German examples all include a stray 
> trailing space in their <tokenizer> class names.
>
> 46. Pg 110: Elision Filter: used for other languages besides French (e.g. 
> Catalan, Italian, and Irish); ElisionFilter class was moved from the 
> o.a.l.analysis.fr package to o.a.l.analysis.util.
>
> 47. Pg 110: Elision Filter: "articles" arg is not required (defaults to 
> FrenchAnalyzer.DEFAULT_ARTICLES)
>
> 48. Pg 110: Elision Filter: "ignoreCase" arg is missing.
>
> 49. Pg 113: Italian: an example using ElisionFilterFactory should be included 
> - there is an articles list in Lucene's ItalianAnalyzer.
>
> 50. Pg 113: Kuromoji: ", as in the following example:" should be removed from 
> the following sentence, since there is no following example: "You can also 
> make discarding punctuation configurable in the JapaneseTokenizerFactory, by 
> setting discardPunctuation to false (to show punctuation) or true (to discard 
> punctuation), as in the following example:"
>
> 51. Pg 114: Lao, Myanmar, Khmer: these are no longer in analysis-extras.  
> There should either be an example for these here, or a pointer to another 
> ICUTokenizerFactory example elsewhere in the guide.
>
> 52.  Pg 114-116: Norwegian: the Snowball stemmer isn't mentioned in the 
> supported Norwegian stemmers list, but the two examples erroneously include 
> the Snowball stemmer *along with another stemmer*!
>
> 53. Pg 117: Russian: Russian Letter Tokenizer is deprecated, and it no longer 
> supports the "charset" arg.
>
> 54. Pg 117: Russian: Russian Lower Case Filter was removed in 4.0.  It should 
> be replaced by LowerCaseFilter in all examples.
>
> Steve
>
> On Sep 25, 2013, at 3:36 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote:
>
>>
>> Please vote to release the following artifacts as the Apache Solr Reference 
>> Guide for 4.5...
>>
>> https://dist.apache.org/repos/dist/dev/lucene/solr/ref-guide/apache-solr-ref-guide-4.5-RC0/
>>
>> $ cat apache-solr-ref-guide-4.5-RC0/apache-solr-ref-guide-4.5.pdf.sha1
>> ee40215d30f264d663f723ea2196b72b8cc5effc  apache-solr-ref-guide-4.5.pdf
>>
>> (When reviewing the PDF, please don't hesitate to point out any typos or 
>> formatting glitches or any other problems of subject matter. Re-spinning a 
>> new RC is trivial, So in my opinion the bar is very low in terms of what 
>> things are worth fixing before relase.)
>>
>>
>>
>>
>>
>> -Hoss
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to