Re: Nutch codebase formatting
Hi Lewis, >> whether we need a Nutch custom code style at all… why don’t we just use >> some other existing style and then enforce it? Enforcing: yes! However, I would try hard to keep the changes on a reasonable minimum. For example, if we change the indentation, almost every code line is affected which makes - "git annotate" mostly useless (or more difficult to use because you need look back) - merges of open PRs, custom patches or modifications in custom repositories might get quite painful, until the formatting is synchronized. >> * google Java format [1] which offers a GitHub action for easy integration >> into our CI process, or +1 + available also for Intellij, Eclipse + indentation stays the same +/- about 25% of the code lines are changed (might be acceptable) >> * superlinter [3] basically emerging as the industry OSS default, offers a >> GitHub action and could also be configured to lint dockerfile, and other >> artifacts. It can also be configured to use the google Java style as well… +1 (with Google Java style) > I’ll submit a PR for superlinter so everyone can see what it would look like. Great! Thanks! Best, Sebastian On 10/29/23 00:38, Lewis John McGibbney wrote: Any thoughts on this folks. I’ll submit a PR for superlinter so everyone can see what it would look like. lewismc On 2023/10/23 19:28:45 lewis john mcgibbney wrote: Hi dev@, For the longest time the Nutch codebase has shipped with a eclipse-codeformat.xml [0] file. Whilst this has been largely successful in keeping the codebase uniform, it cannot/has not been integrated into continuous integration (CI) and subsequently not really enforced! Whilst I’m a big fan of “if it ain’t broken don’t fix it”, I think we should have some CI code formatting checks. Additionally I really question whether we need a Nutch custom code style at all… why don’t we just use some other existing style and then enforce it? I therefore propose that we replace the legacy code formatter with a convention such as * google Java format [1] which offers a GitHub action for easy integration into our CI process, or * check style [2] which offers an Ant task which we could use, this is of less utility as we think about the move to grade * superlinter [3] basically emerging as the industry OSS default, offers a GitHub action and could also be configured to lint dockerfile, and other artifacts. It can also be configured to use the google Java style as well… My preference would be [3] because it offers a more comprehensive linting package for the entire codebase not just the Java code. Thanks for your consideration. lewismc [0] https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml [1] https://github.com/google/google-java-format [2] https://checkstyle.sourceforge.io/ [3] https://github.com/marketplace/actions/super-linter
[jira] [Commented] (NUTCH-3014) Standardize Job names
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17780720#comment-17780720 ] ASF GitHub Bot commented on NUTCH-3014: --- sebastian-nagel commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config) @Override protected int process(String line, StringBuilder output) throws Exception { -Job job = NutchJob.getInstance(getConf()); +Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + this.crawlDb); Review Comment: `this` isn't really required here. > Standardize Job names > - > > Key: NUTCH-3014 > URL: https://issues.apache.org/jira/browse/NUTCH-3014 > Project: Nutch > Issue Type: Improvement > Components: configuration, runtime >Affects Versions: 1.19 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: 1.20 > > > There is a large degree of variability when we set the job name}}{}}} > > {{Job job = NutchJob.getInstance(getConf());}} > {{job.setJobName("read " + segment);}} > > Some examples mention the job name, others don't. Some use upper case, others > don't, etc. > I think we can standardize the NutchJob job names. This would help when > filtering jobs in YARN ResourceManager UI as well. > I propose we implement the following convention > * *Nutch* (mandatory) - static value which prepends the job name, assists > with distinguishing the Job as a NutchJob and making it easily findable. > * *${ClassName}* (mandatory) - literally the name of the Class the job is > encoded in > * *${additional info}* (optional) - value could further distinguish the type > of job (LinkRank Counter, LinkRank Initializer, LinkRank Inverter, etc.) > _{*}Nutch ${ClassName}{*}: *${additional info}*_ > _Examples:_ > * _Nutch LinkRank: Inverter_ > * _Nutch CrawlDb: + $crawldb_ > * _Nutch LinkDbReader: + $linkdb_ > Thanks for any suggestions/comments. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] NUTCH-3014 Standardize Job names [nutch]
sebastian-nagel commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config) @Override protected int process(String line, StringBuilder output) throws Exception { -Job job = NutchJob.getInstance(getConf()); +Job job = Job.getInstance(getConf(), "Nutch CrawlDbReader: process " + this.crawlDb); Review Comment: `this` isn't really required here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org