[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908768#action_12908768 ] Doğacan Güney commented on NUTCH-893: - I want to close this one as INVALID and continue work on NUTCH-879 . Any objections? DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12908791#action_12908791 ] Andrzej Bialecki commented on NUTCH-893: - +1 and +1. DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12907297#action_12907297 ] Andrzej Bialecki commented on NUTCH-893: - Very good catch - yes, the test now passes for me too. This is actually good news for Gora :) I'll continue digging regarding NUTCH-879 ... don't hesitate if you have any ideas how to solve that. I suspect we may be losing keys in Generator or Fetcher, due to partitioning collisions but this hypothesis needs to be tested. DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch, NUTCH-893_v2.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904226#action_12904226 ] Andrzej Bialecki commented on NUTCH-893: - Dogacan, flush() doesn't help - there are still missing keys. What's interesting is that the missing keys form sequential ranges. Could this be perhaps an issue with connection management, or some synchronization issue? DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904316#action_12904316 ] Doğacan Güney commented on NUTCH-893: - The code already calls close() so if flush() doesn't help, then yeah, this sounds like an issue with connection management or synchronization. I'll test what happens if we change SqlStore logic to not buffer statements at all, instead directly execute them. DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Priority: Blocker Fix For: 2.0 Attachments: NUTCH-893.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902893#action_12902893 ] Doğacan Güney commented on NUTCH-893: - I'll go over this issue more carefully. But, in the meantime, did you try this test by adding DataStore#flush? Does it change anything? DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki Attachments: NUTCH-893.patch In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.