[ 
https://issues.apache.org/jira/browse/TIKA-4730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087661#comment-18087661
 ] 

Tim Allison commented on TIKA-4730:
-----------------------------------

*Better*
| Flip | Files | Delta common Tokens | OOV better/worse | Avg lang delta |
  | `UTF-16LE → windows-1252` | 704 | +162,118 | 652 / 18 | +6.80 |
  | `GB18030 → UTF-8` | 994 | +146,271 | 895 / 68 | +0.27 |
  | `ISO-8859-1 → windows-1250` | 337 | +104,281 | 313 / 8 | +0.15 |
  | `x-MacCyrillic → windows-1256` | 179 | +81,733 | 178 / 1 | +7.58 |
  | `UTF-16LE → UTF-8` | 117 | +70,920 | 108 / 0 | +15.0 |
  | `windows-1252 → ISO-8859-2` | 455 | +51,978 | 276 / 21 | −0.05 |
  | `windows-1252 → UTF-8` | 2,841 | +44,035 | 1,162 / 493 | +0.17 |
  | `windows-1252 → windows-1250` | 757 | +32,487 | 615 / 15 | ~0 |
  | `windows-1255 → UTF-8` | 90 | +31,576 | 67 / 3 | +10.1 |
  | `windows-1256 → UTF-8` | 66 | +23,851 | 64 / 0 | +9.6 |
  | `IBM424 → windows-1252` | 39 | +17,427 | 36 / 3 | −0.47 |
  | `EUC-JP → UTF-8` | 281 | +17,957 | 245 / 11 | +0.96 |
  | `EUC-KR → UTF-8` | 275 | +16,715 | 252 / 2 | +0.65 |
  | `UTF-16LE → windows-1254` | 15 | +15,048 | 15 / 0 | +11.2 |
  | `ISO-8859-1 → UTF-8` | 1,796 | +14,254 | 685 / 447 | −0.32 |
  | `ISO-8859-1 → ISO-8859-2` | 132 | +13,885 | 89 / 11 | +0.04 |
  | `windows-1251 → UTF-8` | 235 | +11,920 | 165 / 0 | +2.85 |
  | `x-MacCyrillic → windows-1251` | 280 | +11,451 | 194 / 6 | +1.50 |
  | `windows-1252 → windows-1254` | 392 | +12,301 | 172 / 5 | +0.03 |
  | `windows-1254 → UTF-8` | 200 | +8,447 | 185 / 8 | +1.41 |

*Worse*
| Flip | Files | Delta common tokens | OOV better/worse | Avg lang delta |
  | `Big5 → windows-1252` | 28 | −63,409 | 13 / 14 | −2.05 |
  | `Shift_JIS → x-MacRoman` | 19 | −31,656 | 0 / 4 | −0.27 |
  | `ISO-8859-1 → GB18030` | 6,655 | −15,907 | 179 / 366 | ~0 |
  | `windows-1252 → x-MacRoman` | 392 | −9,251 | 53 / 291 | ~0 |
  | `UTF-8 → windows-1252` | 135 | −8,592 | 31 / 63 | −2.59 |
  | `windows-1250 → windows-1252` | 162 | −6,196 | 30 / 62 | −0.53 |
  | `windows-1251 → windows-1252` | 189 | −5,451 | 48 / 108 | −1.05 |
  | `ISO-8859-1 → x-MacRoman` | 348 | −5,030 | 73 / 250 | +0.45 |
  | `windows-1252 → IBM850` | 867 | −3,889 | 36 / 816 | ~0 |
  | `windows-1252 → IBM852` | 59 | −3,822 | 10 / 40 | −0.22 |
  | `windows-1252 → GB18030` | 789 | −3,706 | 147 / 393 | −0.10 |

> Prep for 4.0.0-beta-1 release
> -----------------------------
>
>                 Key: TIKA-4730
>                 URL: https://issues.apache.org/jira/browse/TIKA-4730
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: reports-2020609.tgz, reports.tar.gz
>
>
> We made a number of important fixes to the published artifacts in ASF's dist 
> repo, maven central and docker.
> I think we're set on changing APIs for 4.x generally.
> Is there anything else we need for this beta release?
> I propose starting the 4.0.0-beta-1 release in two weeks. WDYT?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to