I have just 3 chars to contribute: WOW Otis
On Thu, Sep 26, 2013 at 8:29 AM, Steve Rowe <sar...@gmail.com> wrote: > Except for #1/#34 - internal links to beginning-of-page sections point one > page earlier than they should - and #8/#41 - missing Thai and Polish chars - > which I don't know how to fix, I'll try to address the other items on this > (um, very long) list of mostly minor stuff I found: > > 0. All examples in the exported PDF have an extra blank line at the top. I > was able to eliminate these from this page > <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604227> > ("What is an analyzer?") by eliminating the newline between the initial {code > …} line and the first line of the examples. This doesn't have any apparent > effect on the layout of the page on the wiki, but the PDF export of that page > no longer has the extra blank lines. Any objections to switching all {code} > examples in the guide like this? > > 1. Pg 2: The section links from the TOC all take you to the previous page, > rather than to the top of the page where the section starts. (Same behavior > on OS X Preview, and under Windows, on Firefox's built-in PDF viewer and on > Adobe Reader.) This looks like a general problem - see e.g. #34. > > 2. Pg 68: Stray asterisks in the <analyzer> tags in the <fieldType> example > under "Analysis Phases", apparently to make the surrounded text bold (which > also didn't happen). > > 3. Pg 69: The solr.KeywordTokenizerFactory example is missing one quotation > mark from each of the left and right hand sides. > > 4. Pg 70: Under "solr.TokenizerFactory", there is a bogus "StandardTokenizer" > link in the sentence "Theere aren't any filters that use StandardTokenizer's > types" - the link is to the non-existent "StandardTokenizer" page on the Solr > wiki. (It might be useful to systematically link stuff like this to the > corresponding Lucene or Solr javadocs, but this should probably be templated > or scripted, so that the version-specific links are handled properly.) > > 5. Pg 71: Under "Standard Tokenizer", the email addresses recognition claim > is false, and Internet domain name recognition isn't validation per se, e.g. > "google.supercomputername" will be tokenized as a single token along with > "google.com". The "Out" example output needs fixup accordingly. I see that > the "Classic Tokenizer" section on pg 72 has the verbatim email/domain text; > for ClassicTokenizer, the email claim is true, but it has the same issue with > internet domain names as StandardTokenizer. > > 6. Pg 74: The NGram Tokenizer example output should be ("bicy", "bicyc", > "icyc", "icycl", "cycl", "cycle", "ycle") instead of all of the 4grams before > the 5grams (I think this class's behavior was changed in 4.4 by LUCENE-5042). > > 7. Pg 75: The ICU tokenizer "rulefiles" argument is missing. > > 8. Pg 75: The ICU Tokenizer's "In" input and "Out" output are completely > missing the Thai text that's visible on the wiki. > > 9. Pg 75: Missing spaces in the Regular Expression Pattern Tokenizer's > "group" attribute description, at the boundaries between the first two > sentences: "token(s).The" and "tokens.Non-negative". > > 10. Pg 72, 76, 77, etc.: Many analysis components' factory class names should > be styled with a fixed-width font. > > 11. Pg 77: UAX29 URL Email Tokenizer recognizes not only .com Internet domain > names, but also domain names including any other valid top-level domain > (i.e., unlike StandardTokenizer and ClassicTokenizer, domain names are > validated against the white list drawn from the IANA Root Zone database > <http://www.internic.net/zones/root.zone> as of the last time "ant gen-tld" > was performed and the tokenizer was generated.) > > 12. Pg 77: UAX29 tokenizer: "file:://" should be "file://" > > 13. Pg 77: UAX29 tokenizer's <URL> and <EMAIL> type names are missing angle > brackets. > > 14. Pg 77: UAX29 tokenizer's maxTokenLength attribute name should be styled > with a fixed-width font. > > 15. Pg 78: In the example demonstrating how arguments can be given to > <filter> elements via attributes, there is a stray asterisk, apparently > intended to bold the surrounding text, which also didn't work: *min="2" > max="7"/> > > 16. Pg 79: The ASCII Folding Filter's "Out" output should have the accent > stripped from the "á" -> "a" and the ASCII character value adjusted -> (ASCII > character 97) > > 17. Pg 81: The Edge N-gram Filter's 4-6 gram size example "Out" should be > ("four", "scor", "score", "twen", "twent", "twenty") - some of these are > missing. > > 18. Pg 83: The ICU Normalizer 2 Filter example should include the "name" and > "mode" attributes in the <filter> element. > > 19. Pg 87: Stray asterisks in both of the N-Gram Filter examples: > *minGramSize="... > > 20. Pg 87: The N-Gram Filter 3-5 gram size example "Out" output should be > ("fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore") - rather > than ordering by gram size, output is now ordered first by position and then > by gram size. > > 21. Pg 88: Stray asterisk in the first occurrence only example of the Pattern > Replace Filter: *replace="first". > > 22. Pg 89: "encoder" argument to the Phonetic Filter has surrounding double > curly brackets instead of being styled with a fixed-width font. > > 23. Pg 90: It should be mentioned on Porter Stem Filter that it's *four times > faster* than the English Snowball stemmer - I benchmarked it at > <http://markmail.org/thread/d2c443z63z37rwf6> > > 24. Pg 90: The Position Filter Factory is deprecated and will be removed in > 5.0 - this should be mentioned. > > 25. Pg 90: The Position Filter Factory example has the wrong token position > on the second token - it should be 2 instead of 3. > > 26. Pg 90: The "testsyns.txt" file contents are missing from Remove > Duplicates Token Filter. > > 27. Pg 92: Shingle Filter is missing params "minShingleSize", > "outputUnigramsIfNoShingles", and "tokenSeparator". > > 28. Pg 93: Standard Filter: as of lucene match version 3.1, this filter is a > no-op. > > 29. Pg 94: Stop Filter: the "enablePositionIncrements" arg is no longer > supported as of Lucene/Solr 4.4 - this should be mentioned, and the example > showing its use should be removed. All of the examples need to have their > positions adjusted accordingly. Also, all language-specific examples later > in the guide should have this arg removed. > > 30. Pg 97: Word Delimiter Filter: "-hotspot" is crossed out - the leading > hyphen needs to be escaped or something. > > 31. Pg 97: WDF: Missing period+space in the "splitOnCaseChange" arg > description: "XL"Example 1 > > 32. Pg 97: WDF: "though" -> "through" in "protected" arg description. > > 33. Pg 98: CharFilterFactories: weird wording in "Char Filters can add, > change, or remove characters without worrying about fault of Token offsets." > - better: "Char Filters can add, change, or remove characters while > preserving original character offsets to support e.g. highlighting." > > 34. Pg 99&100: Under solr.HTMLStripCharFilterFactory, the links labeled > "Major Changes from Solr 3 to Solr 4." go one page previous to the start of > this section in the guide. > > 35. Pg 100: solr.HTMLStripCharFilterFactory: this is incorrect: "Inline tags, > such as <b>, <i>, or <span> will be replaced by a space." It should be: > "Inline tags, such as <b>, <i>, or <span> will be removed - no space or > newline will be substituted." > > 36. Pg 100: solr.PatternReplaceCharFilterFactory: All of the "replaceWith" > column contents are missing backslashes; some have commas that shouldn't be > there; and some have curly brackets that shouldn't be there. > > 37. Pg 101: Dictionary Compound Word Token Filter: the content of > "germanwords.txt" ("dummkopfdonaudampfschiff") is missing spaces or newlines > between words - it should be "dumm kopf donau dampf schiff" instead. > > 38. Pg 102: Under "Unicode Collation", s/that also be used/that also *can* be > used/ in "Unicode Collation is a language-sensitive method of sorting text > that also be used for advanced search purposes." > > 39. Pg 102&103: Under "Sorting Text for a Specific Language", in the sentence > "You can see a list of supported Locales _here_", the link is to a list of > supported locales under Java 5. The equivalent Java 6 link is > <http://www.oracle.com/technetwork/java/javase/locales-137662.html>. > Similarly, the Collator javadocs link in the sentence "For more information, > see the _Collator javadocs_", the link is to the Java 5 javadocs - the > equivalent Java 6 link is > <http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html>. > Similarly, under "Sorting Text with Custom Rules", the RuleBasedCollator > javadocs link in the sentence "For more information, see the > _RuleBasedCollator javadocs_" is to the Java 5 javadocs - the equivalent Java > 6 link is > <http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html>. > > 40. Pg 102-105: Under Unicode Collation: (ICU)CollationFilterFactory have > been deprecated (and will be removed in 5.0) in favor of (ICU)CollationField, > which will need descriptions and examples. > > 41. Pg 105: Under Collation Key Filter, several city names in the result > example are missing characters with diacritics: "Białystok" is missing its > "ł", "Łowicz" is missing its "Ł", and "Świdnik" is missing its "Ś". > > 42. Pg 106: ISO Latin Accent Filter: this class is no longer present as of > Solr 4.0 - this section should be replaced with one about ASCIIFoldingFilter. > Also, the solr.MappingCharFilterFactory section on Pg 99 should be changed > to use "mapping-FoldToASCII.txt" instead of "mapping-ISOLatin1Accent.txt". > > 43. Pg 106: Langauge-Specific Factories: Catalan, Danish, Irish and Romanian > are missing from the covered languages; Catalan and Irish should include > ElisionFilterFactory in their examples - there are articles lists in Lucene's > {Catalan,Irish}Analyzer. > > 44. Pg 107-120: Example anlyzers for the following languages don't include a > <tokenizer> - they should include StandardTokenizer: Arabic, Bulgarian, > Czech, Galician, Hindi, Indonesian, Italian, Persian, Polish, Swedish, > Spanish, and Turkish. > > 45. Pg 109-112: The Dutch, Finnish and German examples all include a stray > trailing space in their <tokenizer> class names. > > 46. Pg 110: Elision Filter: used for other languages besides French (e.g. > Catalan, Italian, and Irish); ElisionFilter class was moved from the > o.a.l.analysis.fr package to o.a.l.analysis.util. > > 47. Pg 110: Elision Filter: "articles" arg is not required (defaults to > FrenchAnalyzer.DEFAULT_ARTICLES) > > 48. Pg 110: Elision Filter: "ignoreCase" arg is missing. > > 49. Pg 113: Italian: an example using ElisionFilterFactory should be included > - there is an articles list in Lucene's ItalianAnalyzer. > > 50. Pg 113: Kuromoji: ", as in the following example:" should be removed from > the following sentence, since there is no following example: "You can also > make discarding punctuation configurable in the JapaneseTokenizerFactory, by > setting discardPunctuation to false (to show punctuation) or true (to discard > punctuation), as in the following example:" > > 51. Pg 114: Lao, Myanmar, Khmer: these are no longer in analysis-extras. > There should either be an example for these here, or a pointer to another > ICUTokenizerFactory example elsewhere in the guide. > > 52. Pg 114-116: Norwegian: the Snowball stemmer isn't mentioned in the > supported Norwegian stemmers list, but the two examples erroneously include > the Snowball stemmer *along with another stemmer*! > > 53. Pg 117: Russian: Russian Letter Tokenizer is deprecated, and it no longer > supports the "charset" arg. > > 54. Pg 117: Russian: Russian Lower Case Filter was removed in 4.0. It should > be replaced by LowerCaseFilter in all examples. > > Steve > > On Sep 25, 2013, at 3:36 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > >> >> Please vote to release the following artifacts as the Apache Solr Reference >> Guide for 4.5... >> >> https://dist.apache.org/repos/dist/dev/lucene/solr/ref-guide/apache-solr-ref-guide-4.5-RC0/ >> >> $ cat apache-solr-ref-guide-4.5-RC0/apache-solr-ref-guide-4.5.pdf.sha1 >> ee40215d30f264d663f723ea2196b72b8cc5effc apache-solr-ref-guide-4.5.pdf >> >> (When reviewing the PDF, please don't hesitate to point out any typos or >> formatting glitches or any other problems of subject matter. Re-spinning a >> new RC is trivial, So in my opinion the bar is very low in terms of what >> things are worth fixing before relase.) >> >> >> >> >> >> -Hoss >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org