Except for #1/#34 - internal links to beginning-of-page sections point one page 
earlier than they should - and #8/#41 - missing Thai and Polish chars - which I 
don't know how to fix, I'll try to address the other items on this (um, very 
long) list of mostly minor stuff I found:

0. All examples in the exported PDF have an extra blank line at the top.  I was 
able to eliminate these from this page 
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604227> 
("What is an analyzer?") by eliminating the newline between the initial {code 
…} line and the first line of the examples.  This doesn't have any apparent 
effect on the layout of the page on the wiki, but the PDF export of that page 
no longer has the extra blank lines.  Any objections to switching all {code} 
examples in the guide like this?

1. Pg 2: The section links from the TOC all take you to the previous page, 
rather than to the top of the page where the section starts.  (Same behavior on 
OS X Preview, and under Windows, on Firefox's built-in PDF viewer and on Adobe 
Reader.)  This looks like a general problem - see e.g. #34.

2. Pg 68: Stray asterisks in the <analyzer> tags in the <fieldType> example 
under "Analysis Phases", apparently to make the surrounded text bold (which 
also didn't happen).

3. Pg 69: The solr.KeywordTokenizerFactory example is missing one quotation 
mark from each of the left and right hand sides.

4. Pg 70: Under "solr.TokenizerFactory", there is a bogus "StandardTokenizer" 
link in the sentence "Theere aren't any filters that use StandardTokenizer's 
types" - the link is to the non-existent "StandardTokenizer" page on the Solr 
wiki.  (It might be useful to systematically link stuff like this to the 
corresponding Lucene or Solr javadocs, but this should probably be templated or 
scripted, so that the version-specific links are handled properly.)

5. Pg 71: Under "Standard Tokenizer", the email addresses recognition claim is 
false, and Internet domain name recognition isn't validation per se, e.g. 
"google.supercomputername" will be tokenized as a single token along with 
"google.com".  The "Out" example output needs fixup accordingly.  I see that 
the "Classic Tokenizer" section on pg 72 has the verbatim email/domain text; 
for ClassicTokenizer, the email claim is true, but it has the same issue with 
internet domain names as StandardTokenizer.

6. Pg 74: The NGram Tokenizer example output should be ("bicy", "bicyc", 
"icyc", "icycl", "cycl", "cycle", "ycle") instead of all of the 4grams before 
the 5grams (I think this class's behavior was changed in 4.4 by LUCENE-5042).

7. Pg 75: The ICU tokenizer "rulefiles" argument is missing.

8. Pg 75: The ICU Tokenizer's "In" input and "Out" output are completely 
missing the Thai text that's visible on the wiki.

9. Pg 75: Missing spaces in the Regular Expression Pattern Tokenizer's "group" 
attribute description, at the boundaries between the first two sentences: 
"token(s).The" and "tokens.Non-negative".

10. Pg 72, 76, 77, etc.: Many analysis components' factory class names should 
be styled with a fixed-width font.

11. Pg 77: UAX29 URL Email Tokenizer recognizes not only .com Internet domain 
names, but also domain names including any other valid top-level domain (i.e., 
unlike StandardTokenizer and ClassicTokenizer, domain names are validated 
against the white list drawn from the IANA Root Zone database 
<http://www.internic.net/zones/root.zone> as of the last time "ant gen-tld" was 
performed and the tokenizer was generated.)

12. Pg 77: UAX29 tokenizer: "file:://" should be "file://"

13. Pg 77: UAX29 tokenizer's <URL> and <EMAIL> type names are missing angle 
brackets.

14. Pg 77: UAX29 tokenizer's maxTokenLength attribute name should be styled 
with a fixed-width font.

15. Pg 78: In the example demonstrating how arguments can be given to <filter> 
elements via attributes, there is a stray asterisk, apparently intended to bold 
the surrounding text, which also didn't work: *min="2" max="7"/>

16. Pg 79: The ASCII Folding Filter's "Out" output should have the accent 
stripped from the "á" -> "a" and the ASCII character value adjusted -> (ASCII 
character 97)

17. Pg 81: The Edge N-gram Filter's 4-6 gram size example "Out" should be 
("four", "scor", "score", "twen", "twent", "twenty") - some of these are 
missing.

18. Pg 83: The ICU Normalizer 2 Filter example should include the "name" and 
"mode" attributes in the <filter> element.

19. Pg 87: Stray asterisks in both of the N-Gram Filter examples: 
*minGramSize="...

20. Pg 87: The N-Gram Filter 3-5 gram size example "Out" output should be 
("fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore") - rather 
than ordering by gram size, output is now ordered first by position and then by 
gram size.

21. Pg 88: Stray asterisk in the first occurrence only example of the Pattern 
Replace Filter: *replace="first".

22. Pg 89: "encoder" argument to the Phonetic Filter has surrounding double 
curly brackets instead of being styled with a fixed-width font. 

23. Pg 90: It should be mentioned on Porter Stem Filter that it's *four times 
faster* than the English Snowball stemmer - I benchmarked it at 
<http://markmail.org/thread/d2c443z63z37rwf6>

24. Pg 90: The Position Filter Factory is deprecated and will be removed in 5.0 
- this should be mentioned.

25. Pg 90: The Position Filter Factory example has the wrong token position on 
the second token - it should be 2 instead of 3.

26. Pg 90: The "testsyns.txt" file contents are missing from Remove Duplicates 
Token Filter.

27. Pg 92: Shingle Filter is missing params "minShingleSize", 
"outputUnigramsIfNoShingles", and "tokenSeparator".

28. Pg 93: Standard Filter: as of lucene match version 3.1, this filter is a 
no-op.

29. Pg 94: Stop Filter: the "enablePositionIncrements" arg is no longer 
supported as of Lucene/Solr 4.4 - this should be mentioned, and the example 
showing its use should be removed.  All of the examples need to have their 
positions adjusted accordingly.  Also, all language-specific examples later in 
the guide should have this arg removed.

30. Pg 97: Word Delimiter Filter: "-hotspot" is crossed out - the leading 
hyphen needs to be escaped or something.

31. Pg 97: WDF: Missing period+space in the "splitOnCaseChange" arg 
description: "XL"Example 1 

32. Pg 97: WDF: "though" -> "through" in "protected" arg description.

33. Pg 98: CharFilterFactories: weird wording in "Char Filters can add, change, 
or remove characters without worrying about fault of Token offsets." - better: 
"Char Filters can add, change, or remove characters while preserving original 
character offsets to support e.g. highlighting."

34. Pg 99&100: Under solr.HTMLStripCharFilterFactory, the links labeled "Major 
Changes from Solr 3 to Solr 4." go one page previous to the start of this 
section in the guide.

35. Pg 100: solr.HTMLStripCharFilterFactory: this is incorrect: "Inline tags, 
such as <b>, <i>, or <span> will be replaced by a space."  It should be: 
"Inline tags, such as <b>, <i>, or <span> will be removed - no space or newline 
will be substituted."

36. Pg 100: solr.PatternReplaceCharFilterFactory: All of the "replaceWith" 
column contents are missing backslashes; some have commas that shouldn't be 
there; and some have curly brackets that shouldn't be there.

37. Pg 101: Dictionary Compound Word Token Filter: the content of 
"germanwords.txt" ("dummkopfdonaudampfschiff") is missing spaces or newlines 
between words - it should be "dumm kopf donau dampf schiff" instead.

38. Pg 102: Under "Unicode Collation", s/that also be used/that also *can* be 
used/ in "Unicode Collation is a language-sensitive method of sorting text that 
also be used for advanced search purposes."

39. Pg 102&103: Under "Sorting Text for a Specific Language", in the sentence 
"You can see a list of supported Locales _here_", the link is to a list of 
supported locales under Java 5.  The equivalent Java 6 link is 
<http://www.oracle.com/technetwork/java/javase/locales-137662.html>.  
Similarly, the Collator javadocs link in the sentence "For more information, 
see the _Collator javadocs_", the link is to the Java 5 javadocs - the 
equivalent Java 6 link is 
<http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html>.  Similarly, 
under "Sorting Text with Custom Rules", the RuleBasedCollator javadocs link in 
the sentence "For more information, see the _RuleBasedCollator javadocs_" is to 
the Java 5 javadocs - the equivalent Java 6 link is 
<http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html>.

40. Pg 102-105: Under Unicode Collation: (ICU)CollationFilterFactory have been 
deprecated (and will be removed in 5.0) in favor of (ICU)CollationField, which 
will need descriptions and examples.

41. Pg 105: Under Collation Key Filter, several city names in the result 
example are missing characters with diacritics: "Białystok" is missing its "ł", 
"Łowicz" is missing its "Ł", and "Świdnik" is missing its "Ś".

42. Pg 106: ISO Latin Accent Filter: this class is no longer present as of Solr 
4.0 - this section should be replaced with one about ASCIIFoldingFilter.  Also, 
the solr.MappingCharFilterFactory section on Pg 99 should be changed to use 
"mapping-FoldToASCII.txt" instead of "mapping-ISOLatin1Accent.txt".

43. Pg 106: Langauge-Specific Factories: Catalan, Danish, Irish and Romanian 
are missing from the covered languages; Catalan and Irish should include 
ElisionFilterFactory in their examples - there are articles lists in Lucene's 
{Catalan,Irish}Analyzer.

44. Pg 107-120: Example anlyzers for the following languages don't include a 
<tokenizer> - they should include StandardTokenizer: Arabic, Bulgarian, Czech, 
Galician, Hindi, Indonesian, Italian, Persian, Polish, Swedish, Spanish, and 
Turkish.

45. Pg 109-112: The Dutch, Finnish and German examples all include a stray 
trailing space in their <tokenizer> class names.

46. Pg 110: Elision Filter: used for other languages besides French (e.g. 
Catalan, Italian, and Irish); ElisionFilter class was moved from the 
o.a.l.analysis.fr package to o.a.l.analysis.util.

47. Pg 110: Elision Filter: "articles" arg is not required (defaults to 
FrenchAnalyzer.DEFAULT_ARTICLES)

48. Pg 110: Elision Filter: "ignoreCase" arg is missing. 

49. Pg 113: Italian: an example using ElisionFilterFactory should be included - 
there is an articles list in Lucene's ItalianAnalyzer.

50. Pg 113: Kuromoji: ", as in the following example:" should be removed from 
the following sentence, since there is no following example: "You can also make 
discarding punctuation configurable in the JapaneseTokenizerFactory, by setting 
discardPunctuation to false (to show punctuation) or true (to discard 
punctuation), as in the following example:"

51. Pg 114: Lao, Myanmar, Khmer: these are no longer in analysis-extras.  There 
should either be an example for these here, or a pointer to another 
ICUTokenizerFactory example elsewhere in the guide.

52. Pg 114-116: Norwegian: the Snowball stemmer isn't mentioned in the 
supported Norwegian stemmers list, but the two examples erroneously include the 
Snowball stemmer *along with another stemmer*!

53. Pg 117: Russian: Russian Letter Tokenizer is deprecated, and it no longer 
supports the "charset" arg.

54. Pg 117: Russian: Russian Lower Case Filter was removed in 4.0.  It should 
be replaced by LowerCaseFilter in all examples.

Steve

On Sep 25, 2013, at 3:36 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote:

> 
> Please vote to release the following artifacts as the Apache Solr Reference 
> Guide for 4.5...
> 
> https://dist.apache.org/repos/dist/dev/lucene/solr/ref-guide/apache-solr-ref-guide-4.5-RC0/
> 
> $ cat apache-solr-ref-guide-4.5-RC0/apache-solr-ref-guide-4.5.pdf.sha1
> ee40215d30f264d663f723ea2196b72b8cc5effc  apache-solr-ref-guide-4.5.pdf
> 
> (When reviewing the PDF, please don't hesitate to point out any typos or 
> formatting glitches or any other problems of subject matter. Re-spinning a 
> new RC is trivial, So in my opinion the bar is very low in terms of what 
> things are worth fixing before relase.)
> 
> 
> 
> 
> 
> -Hoss
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to